Skip to content

[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul#611

Draft
Nucs wants to merge 79 commits intomasterfrom
nditer
Draft

[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul#611
Nucs wants to merge 79 commits intomasterfrom
nditer

Conversation

@Nucs
Copy link
Copy Markdown
Member

@Nucs Nucs commented Apr 22, 2026

Summary

This PR ports NumPy 2.4.2's nditer machinery to NumSharp (NpyIter), builds a composable expression DSL on top (NpyExpr) with a three-tier custom-op API, wires multi-order memory layout (C/F/A/K) through the entire API surface, and replaces the matmul fallback with stride-native GEMM for all 12 dtypes (eliminates a ~100x slowdown on transposed inputs). Also lands a new Char8 1-byte dtype with 100% Python bytes parity, a trainable MNIST MLP example demonstrating fusion, and several pre-existing bug fixes surfaced by battletest.

Stats: +50,426 / -1,188 across 156 files, 64 commits.

TL;DR

  • NpyIter -- full NumPy nditer port: 32+ APIs, all iteration orders (C/F/A/K with NEGPERM), all indexing modes (MULTI_INDEX / C_INDEX / F_INDEX / RANGE), buffered casting, buffered-reduce double-loop, masking, unlimited operands and dimensions. 566/566 NumPy 2.4.2 parity scenarios pass byte-for-byte.
  • NpyExpr DSL + 3-tier custom-op API (Tier 3A raw IL / Tier 3B element-wise w/ SIMD / Tier 3C expression composition + Call(...) for arbitrary Func / MethodInfo invocation). 50+ ops, full operator overloads, structural cache key, ~5ns delegate dispatch.
  • C/F/A/K memory layout wired through np.copy, np.array, np.asarray, np.asanyarray, np.asfortranarray (new), np.ascontiguousarray (new), *_like, astype, flatten, ravel, reshape, eye, concatenate, vstack, hstack, cumsum, argsort -- plus post-hoc F-contig preservation across the ILKernel dispatchers (41 element-wise layout bugs fixed).
  • Stride-native matmul for all 12 dtypes -- np.dot(x.T, grad) MLP shape: 240 ms -> 1 ms. Removes ~165 lines of dead fallback code.
  • Char8 -- new 1-byte dtype, NumPy S1 / Python bytes(1) equivalent, full Python bytes API parity (battletested against Python oracle).
  • Trainable MNIST MLP example -- fused forward + backward + Adam, 99.89% test accuracy, 100 epochs in ~42s.
  • Bug fixes: NPTypeCode.Char.SizeOf() returned 1 (real=2), IsInf was stubbed to return null, Decimal priority was stale, argsort mishandled non-C-contig input, several NpyExpr IL emission bugs.
  • Test suite: 6,710 passing on net8.0 + net10.0, zero regressions.

Contents

NpyIter -- Full NumPy nditer Port

From-scratch C# port of NumPy 2.4.2's nditer under src/NumSharp.Core/Backends/Iterators/.

Files (new)

File Lines Purpose
NpyIter.cs 3,331 Core iterator engine
NpyIter.State.cs 977 Dynamic per-operand state allocation
NpyIter.Execution.cs 657 Iteration execution paths
NpyIter.Execution.Custom.cs 155 Tier 3A/3B/3C custom-op entry points
NpyIterBufferManager.cs 637 Buffered iteration with casting
NpyIterFlags.cs 516 Flag enums (matches NumPy semantics)
NpyIterCoalescing.cs 495 Axis reorder + coalesce
NpyIterCasting.cs 483 Casting validation + conversion
NpyIterKernels.cs 263 Path selection
NpyAxisIter.cs 492 Axis-based iterator (cumsum, var, std)
NpyLogicalReductionKernels.cs 127 Generic reduction kernel interfaces

Capability Matrix

Capability Notes
Iteration orders C, F, A, K (NEGPERM for negative-stride memory order)
Indexing modes MULTI_INDEX, C_INDEX, F_INDEX, RANGE (parallel chunking)
Buffering Type conversion, full casting rules (no / equiv / safe / same_kind / unsafe)
Reduction op_axes w/ -1 reduction axes, REDUCE_OK, IsFirstVisit, buffered-reduce double-loop including bufferSize < coreSize
Multi-operand Unlimited (NumPy NPY_MAXARGS=64 parity, dynamic alloc)
Dimensions Unlimited (NumSharp divergence; replaces NPY_MAXDIMS=64)
Masking WRITEMASKED + ARRAYMASK w/ reduction safety check
APIs ported Copy, GotoIndex, GotoMultiIndex, RemoveAxis, RemoveMultiIndex, ResetBasePointers, GetMultiIndexFunc, GetInnerFixedStrideArray, GetAxisStrideArray, CreateCompatibleStrides, DebugPrint, GetIterView, IterRange, Iternext, GetValue<T> / SetValue<T>, Finished, Shape, OVERLAP_ASSUME_ELEMENTWISE, TRANSFERFLAGS, reduction-axis encoding, more

Bugs found and fixed during port

  • Negative strides flipped regardless of order -- NumPy only flips when FORCEDORDER is unset (K-order only).
  • NO_BROADCAST flag not enforced -- was skipping operands instead of validating shape match.
  • F_INDEX returned C-order indices -- coalescing reduced NDim=1, losing axis structure. Now disables coalescing for C_INDEX / F_INDEX / MULTI_INDEX.
  • ALLOCATE with null operand threw NRE -- CalculateBroadcastShape accessed op[i].ndim before allocating.
  • op_axes reductions broken -- ApplyOpAxes was re-applying axes to already-correct strides, zeroing non-reduce strides.
  • SetupBufferedReduction produced inverted strides for non-reduce operands; only worked for 2-axis cases.
  • K-order on broadcast / non-contiguous arrays produced wrong order -- fixed by falling back to C-order when stride=0 is present.
  • Reset() desynced ranged iterators with IterStart > 0 -- now delegates to GotoIterIndex(IterStart).
  • CoalesceAxes rejected size-1 axes unless stride==0 -- size-1 axes contribute no iteration and should always absorb.
  • Buffer free-list bug -- Dispose used NativeMemory.Free for AlignedAlloc'd buffers (memory corruption).

NpyExpr DSL + 3-tier Custom-Op API

User-extensible kernel layer built on NpyIter.

Tiers

  • Tier 3A -- ExecuteRawIL(body, key, aux): emit raw IL against the NumPy ufunc signature void(void** dataptrs, long* byteStrides, long count, void*). Full control.
  • Tier 3B -- ExecuteElementWise(scalar, vector, ...): per-element IL + 4x-unrolled SIMD shell + 1-vec remainder + scalar tail + scalar-strided fallback. SIMD when all operand dtypes match and are SIMD-capable.
  • Tier 3C -- ExecuteExpression(expr, inputTypes, outputType): compose NpyExpr trees via operator syntax, Compile() emits IL automatically.

NpyExpr Node Catalog

Category Ops
Binary arithmetic Add Subtract Multiply Divide Mod Power FloorDivide ATan2
Binary bitwise BitwiseAnd BitwiseOr BitwiseXor
Unary arithmetic Negate Abs Sign Sqrt Cbrt Square Reciprocal
Unary exp/log Exp Exp2 Expm1 Log Log2 Log10 Log1p
Unary trig Sin Cos Tan Sinh Cosh Tanh ASin ACos ATan Deg2Rad Rad2Deg
Unary rounding Floor Ceil Round Truncate
Unary predicates BitwiseNot LogicalNot IsNaN IsFinite IsInf
Comparisons Equal NotEqual Less LessEqual Greater GreaterEqual (return 0/1 at output dtype)
Combinators Min Max Clamp Where(cond, a, b)
Operators + - * / % & OR ^ ~ ! unary-

Call(...) escape hatch (commit 8da3e693)

Invoke any Func<...>, Delegate, or MethodInfo per element -- three dispatch paths chosen at construction:

Condition Emitted IL
Static method, no target call <methodinfo> (zero-indirection, JIT-inlinable)
Instance MethodInfo + target ldc.i4 id -> LookupTarget -> castclass T -> callvirt <mi>
Any other Delegate ldc.i4 id -> LookupDelegate -> castclass Func<..> -> callvirt Invoke

Auto-conversion at the call boundary (input/output via EmitConvertTo). Process-wide DelegateSlots registry, ~5ns lookup. Cache key includes MetadataToken + ModuleVersionId to disambiguate dynamic assemblies.

Bugs caught during DSL battletest

  • IsNaN / IsFinite / IsInf silently wrote I4 0/1 into double slots (denormals instead of 1.0). Fix: UnaryNode inserts trailing EmitConvertTo(Int32, outType).
  • LogicalNot broken for Int64 / Single / Double / Decimal -- Ldc_I4_0+Ceq only valid for I4-sized operands. Fix: route through EmitComparisonOperation(Equal, outType) with properly-typed zero literal.
  • WhereNode prelude unfinished (threw at compile). Rewrote.
  • MinMaxNode did not propagate NaN -- rerouted through Math.Min / Math.Max (matches np.minimum / np.maximum, not fmin / fmax).
  • Vector256.Round / Truncate are .NET 9+ only -- excluded from SIMD path; scalar path works on net8.

Multi-Order Memory Layout (C/F/A/K)

Shape changes (src/NumSharp.Core/View/Shape.cs, +171 lines)

  • New IsFContiguous (O(1) flag check via ArrayFlags.F_CONTIGUOUS).
  • New ComputeFContiguousStrides helper.
  • New Shape(long[] dims, char order) ctor for explicit physical-order construction.
  • Aligned _UpdateContiguousFlags with NumPy -- single-pass (isC, isF) tuple, fewer call sites.
  • Fixed Shape.Order -- was hardcoded to 'C', now derives from contiguity flags.
  • Fixed empty-array semantics -- any dim==0 is both C- and F-contig per NumPy, no BROADCASTED flag.
  • OrderResolver.cs (new, 75 lines) -- centralizes C/F/A/K -> C/F mapping.

API surface wiring

API Change
np.copy, NDArray.copy(order) New overload, default 'K' (was 'C')
np.array(Array, ..., order) F-order materialization via copy('F')
np.asarray, np.asanyarray New overloads accepting Type? + order
np.asfortranarray, np.ascontiguousarray NEW thin wrappers
np.empty_like / zeros_like / ones_like / full_like New order overload, default 'K'
astype(Type, copy, order) New overload, default 'K'
flatten(order), ravel(order) F-order via copy('F') reinterpret
reshape(Shape, char order) New overload; F-order column-major fill
np.eye(..., order) New overload
np.concatenate, vstack, hstack Preserve F-contig when all inputs F-contig
np.cumsum(axis) Preserve F-contig via post-hoc copy('F')
NDArray.argsort Copies non-C-contig input to C-contig first

Post-hoc F-contig preservation across ILKernel dispatch (Group F, 41 bugs fixed)

Refactoring 27 partial files (~21K lines) of IL emitters to accept arbitrary output strides was rejected as too invasive. Instead, central dispatchers (ExecuteBinaryOp, ExecuteUnaryOp, ExecuteComparisonOp) call ShouldProduceFContigOutput(operands, resultShape) after the kernel and relay via cheap .copy('F') when every non-scalar operand is strictly F-contig. Matches NumPy's F+F->F, C+C->C, F+C->C, F*scalar->F, F+FCol->F rules.

np.modf, np.clip, np.negative, np.maximum / minimum updated individually (do not route through the central dispatchers).

TDD coverage

51 sections of OrderSupport.OpenBugs.Tests.cs (3,005 lines), each driven by side-by-side Python / NumPy 2.4.2 output. Remaining [OpenBugs] are minimal API gaps (np.tile, np.flip, np.where, np.sort).

Stride-Native MatMul

np.dot / np.matmul previously fell into a ~100x slower fallback for any non-contiguous operand (transposed view, slice). This PR ships stride-native paths for all 12 dtypes.

New files

  • SimdMatMul.Strided.cs (338 lines) -- generalized 8x16 Vector256 FMA micro-kernel for float. New packers PackAPanelsStrided / PackBPanelsStrided with fast paths for transposed-contig and row-contig.
  • SimdMatMul.Double.cs (108 lines) -- stride-aware IKJ Vector256 kernel (4 FMAs).
  • Default.MatMul.Strided.cs (357 lines) -- MatMulStridedSame<T> where T : INumber<T> (JIT specializes per type with auto-vectorization), plus MatMulStridedBool (NumPy's OR-of-ANDs short-circuit), MatMulStridedMixed<TResult> (typed pointer reads via ReadAsDouble, no boxing).

Dead code removed (~165 lines)

MatMulGeneric, MatMulCore<TResult>, MatMulSameType<T>, four MatMulContiguous overloads (float/double/int/long), MatMulMixedType<TResult>.

Performance (MLP backward shapes)

Op Before After
dot(x.T, grad) 784x64 @ 64x128 240 ms 1 ms
dot(grad, W.T) 64x128 @ 128x784 226 ms 1 ms
Lt(400,500) @ L(500,400) blocked 12 ms 8 ms (skips copy)
int32 (150,200) @ (200,150) stride-native -- 10 ms (was 11 ms copy+matmul)

The MLP example's .copy() workaround on transposed views is now removed -- kernel absorbs strides directly.

Test coverage

MatMulStridedTests (28 tests): all 4 BLAS transpose cases (NN/NT/TN/TT) x float/double x simple/blocked, per-dtype stride-native (byte/int16/uint16/int32/uint32/int64/uint64/char/decimal/bool), sliced views with Shape.offset > 0, mixed-type, MLP-shape regression tests.

Char8 Dtype

New NumSharp.Char8 -- [StructLayout(Sequential, Size=1)] readonly struct, NumPy dtype('S1') / Python bytes(1) equivalent.

Files (new, ~1,450 lines)

File Lines Purpose
Char8.cs 725 Core: Latin1CharInfo, classification, case, operators, IConvertible, IFormattable
Char8.Operators.cs 169 Mixed-type ops (char/byte/int), Span reinterpret
Char8.Conversions.cs 261 Instance/factory To* / From* for all 12 dtypes, saturating, truncating
Char8.Spans.cs 201 Search / equality / UTF-8 classification on ReadOnlySpan<Char8>
Char8.PyBytes.cs 531 Python bytes array methods (Strip/Split/Replace/Center/...)
Converts.Char8.cs 317 Converts integration parallel to Converts.Native.cs

Adapted from .NET System.Char (src/dotnet/, fetched into a reference library; Latin1CharInfo[256] table copied verbatim).

Python parity (caught by oracle diff)

3 parity bugs surfaced and fixed during 250-line Python bytes oracle comparison:

  1. Count with empty pattern -- Python returns len(s) + 1, was 0.
  2. Center asymmetric padding -- CPython formula left = pad/2 + (pad & width & 1).
  3. SplitLines too permissive -- bytes.splitlines() only recognizes \n / \r / \r\n (not \v / \f / \x1c-1e).

Status

Standalone for now -- not yet wired into NPTypeCode enum (would touch ~50 switch statements across DefaultEngine / ILKernelGenerator / NpyIter / casting / Converts; deferred to a separate PR).

MNIST MLP Example

examples/NeuralNetwork.NumSharp/MnistMlp/ -- runnable end-to-end classifier.

  • Architecture: 784 -> 128 (ReLU) -> 10, float32, He-init, Adam.
  • Forward fusion: bias + ReLU collapses into one NpyIter per layer (NpyExpr.Max(Input(0) + Input(1), 0)).
  • Backward fusion: gradOut * (y > 0) ReLU mask fused.
  • Loss: SoftmaxCrossEntropy (combined, max-subtracted, numerically stable).

Results (6000 train / 1000 test, batch 128, Adam lr=1e-3):

Phase Total Final test acc
Pre-stride-native dot 100.7 s (5 epochs) 100%
Post-copy() workaround 3.2 s (5 epochs) 100%
Final 100-epoch demo (post stride-native) ~42 s 99.89%

NN scaffolding fixes outside MnistMlp/: completed every stubbed/broken class -- Softmax (was empty Forward + sigmoid-derivative Backward), Sigmoid.Forward (empty), CategoricalCrossentropy (no clipping, wrong backward formula), BinaryCrossEntropy (did not divide by N), Accuracy / BinaryAccuacy (returned scalar/null), FullyConnected (no bias, skewed init), NeuralNet.Train (used 2-index integer selection where slicing was intended), Adam (ms / vs init was commented out), added SGD optimizer. Each verified against analytical references with finite-difference grad checks (29/29 pass).

Bug Fixes

  • NPTypeCode.Char.SizeOf() returned 1, real is 2 (UTF-16). Affected NpyIter.SetOpDType (ElementSizes[op] x stride in 8 places), 8 cast sites, np.frombuffer, np.dtype(char).itemsize, axis reductions. Survived without test failures because NumPy has no native char dtype and ASCII reads accidentally land on the right byte.
  • GetPriority(Decimal) = 5*10*32 was stale after the prior Decimal SizeOf fix -- corrected to 5*10*16=800. No behavioral change (relative ordering preserved).
  • DefaultEngine.IsInf was stubbed to return null (NRE on any IsInf call). Now wired through ExecuteUnaryOp with the existing IL kernel.
  • NDArray.Copy.cs share-by-reference bug -- new Shape(this.Shape.dimensions, 'F') aliased the source int[]; cloned now.
  • NDArray.argsort -- copies non-C-contig input to C-contig first (matches NumPy's invariant that argsort always returns C-contig).
  • flatten allocation regression -- F-order path was double-allocating (copy('F') + Array.Clone()). Fixed: reinterpret directly.

Behavioral Changes

Area Change Migration
np.copy default order 'C' -> 'K' No change for C-contig input (K preserves layout)
MaxOperands=8 removed Now unlimited (dynamic alloc) Drop-in
MaxDims=64 removed Now unlimited (~300K dims, stackalloc-bound) Drop-in
F-order iteration Now produces [0,3,1,4,2,5] for 2x3 C-contig (was [0,1,2,3,4,5]) Matches NumPy
K-order on broadcast / non-contig Falls back to C-order (was stride-sort, broken with stride=0) Matches NumPy
Negative strides Only flipped for K-order Matches NumPy FORCEDORDER rule
Empty arrays IsContiguous and IsFContiguous both true (was both false) Matches NumPy
Shape.Order Now derives from contiguity flags (transpose of C reports 'F') Was hardcoded to 'C'

Documentation

  • docs/website-src/docs/NDIter.md (1,934 lines) -- comprehensive NpyIter reference: 7-technique quick reference, decision tree, full Tier C node catalog with NumPy-equivalent column, type discipline, SIMD coverage rules, caching/auto-keys, validation, gotchas, debugging, memory model + lifetime, 19 worked examples (Swish, GELU, Heaviside, Horner polynomial, fused sigmoid, NaN replacement, etc.).
  • docs/website-src/docs/ndarray.md (537 lines) -- NDArray reference page.
  • docs/NPYITER_AUDIT.md, NPYITER_DEEP_AUDIT.md, NPYITER_NUMPY_DIFFERENCES.md, NPYITER_BUFFERED_REDUCE_ANALYSIS.md -- implementation audit reports.
  • Tier renamed A/B/C -> 3A/3B/3C to make the layer-3 sub-tier relationship explicit (100 references across 6 files).

Test Plan

  • Full suite: 6,710 tests pass on net8.0 + net10.0 with CI filter TestCategory!=OpenBugs&TestCategory!=HighMemory. Zero regressions.
  • NumPy 2.4.2 nditer parity: 566/566 scenarios pass byte-for-byte (491 random fuzz seed=42 + 75 structured). Element sequences, stride arrays, multi-indices, reduction outputs all match.
  • NpyExpr + custom-op tests: +264 tests (NpyIterCustomOpTests, NpyIterCustomOpEdgeCaseTests, NpyExprExtensiveTests, NpyExprCallTests).
  • nditer API parity tests: +94 across 10 new files (NpyIterAxisStrideArrayTests, NpyIterCreateCompatibleStridesTests, NpyIterDebugPrintTests, NpyIterGetMultiIndexFuncTests, NpyIterInnerFixedStrideArrayTests, NpyIterOverlapAssumeElementwiseTests, NpyIterReductionAxisEncodingTests, NpyIterResetBasePointersTests, NpyIterTransferFlagsTests, NpyIterWriteMaskedTests).
  • MatMul stride-native: +28 MatMulStridedTests covering all 4 BLAS transpose cases, per-dtype stride-native, sliced views, mixed-type, MLP shapes.
  • Char8: +69 cases (source-generated discovery), 250-line Python bytes oracle diff (identical), 270+ C# edge assertions.
  • Order support TDD: +150 tests across 51 sections in OrderSupport.OpenBugs.Tests.cs.
  • Shape order: +24 tests in Shape.Order.Tests.cs.
  • MLP example: 100-epoch run reaches 99.89% test accuracy; per-epoch breakdown: forward 20.3% / loss 1.5% / backward 52.5% / optimizer 25.8%; bit-exact fused-vs-naive correctness (max |diff| = 0); kernel cache delta = 6 (compiled once, cache-hit thereafter); delegate-slot count = 0.

Known Issues / Out of Scope

  • Char8 not wired into NPTypeCode -- would touch ~50 switch statements; separate PR.
  • np.tile, np.flip, np.where, np.sort still missing (4 [OpenBugs] markers remain after this PR).
  • One pre-existing SetIndicesND assertion on multi-row fancy-write with scalar/matching-array RHS -- investigation in commit 47a74aa9 showed it is a pre-existing indexing bug, not F-order specific. Reproductions added under [OpenBugs].

Migration / Compatibility

Most changes are additive. The behavioral changes table above lists the user-visible deltas -- all align NumSharp closer to NumPy 2.4.2. No deprecated APIs, no removed public surface. The MaxOperands=8 and MaxDims=64 constants are removed but were internal.

Nucs added 30 commits April 22, 2026 23:27
Implements fixes detailed in docs/NPYITER_FIXES_REQUIRED.md to improve
NumPy compatibility of the NpyIter implementation.

Fix #1: Coalescing Always Runs
- Changed NpyIterRef.Initialize() to always coalesce axes after
  construction unless MULTI_INDEX flag is set
- Matches NumPy's nditer_constr.c line 395-396 behavior

Fix #2: Inner Stride Cache
- Added InnerStrides[MaxOperands] array to NpyIterState
- Added UpdateInnerStrides() method to gather inner strides
- GetInnerStrideArray() now returns contiguous array matching
  NumPy's NpyIter_GetInnerStrideArray() format

Fix #3: op_axes Parameter Implementation
- Added ApplyOpAxes() method to support axis remapping
- Supports -1 entries for broadcast/reduction axes
- Enables reduction operations via custom axis mapping

Fix #4: Multi-Index Support
- Added GetMultiIndex(Span<long>) for coordinate retrieval
- Added GotoMultiIndex(ReadOnlySpan<long>) for coordinate jumping
- Added HasMultiIndex property
- HASMULTIINDEX flag tracked during construction

Fix #5: Ranged Iteration
- Added ResetToIterIndexRange(start, end) for parallel chunking
- Added IterStart, IterEnd, and IsRanged properties
- RANGE flag tracks ranged iteration mode

Fix #6: Buffer Copy Type Dispatch
- Added non-generic CopyToBuffer/CopyFromBuffer overloads
- Runtime dtype dispatch for all 12 NumSharp types
- Enables dtype-agnostic iteration code

Fix #7: Flag Bit Positions Documented
- Added documentation explaining NumSharp's flag bit layout
- Legacy compatibility flags use bits 0-7
- NumPy-equivalent flags use bits 8-15
- Semantic meaning matches NumPy, positions differ

Fix #8: MaxDims Increased to 64
- Changed MaxDims from 32 to 64 to match NPY_MAXDIMS
- Supports high-dimensional array iteration

Test coverage:
- 13 new tests for coalescing, multi-index, ranged iteration,
  inner strides, and MaxDims validation
- All 5666 non-OpenBugs tests pass

Note: Full axis reordering before coalescing (for complete 1D
coalescing of contiguous arrays) not yet implemented. Current
implementation coalesces adjacent compatible axes only.
BREAKING: Replaces NumPy's fixed NPY_MAXDIMS=64 limit with unlimited
dimension support via dynamic array allocation.

NumSharp Divergence Rationale:
- NumSharp's Shape uses regular managed arrays (int[] dimensions, int[] strides)
- Practical limit is ~300,000 dimensions (stackalloc limit)
- This matches NumSharp's core design philosophy of unlimited dimensions
- Memory scales with actual dimensions, not worst-case fixed allocation

Implementation:
- Removed MaxDims constant (was 64)
- Added StridesNDim field to track stride array allocation size
- Dimension-dependent arrays (Shape, Coords, Perm, Strides) are now
  dynamically allocated pointers instead of fixed arrays
- Added AllocateDimArrays(ndim, nop) for allocation
- Added FreeDimArrays() for cleanup
- All arrays allocated in single contiguous block for cache efficiency

Per-operand arrays still use fixed MaxOperands=8 limit (reasonable for
multi-operand operations).

Memory Management:
- NpyIterRef.Dispose() calls FreeDimArrays()
- Static NpyIter methods use try/finally for cleanup
- Exception handling properly frees arrays on construction failure

Updated files:
- NpyIter.State.cs: Dynamic allocation with detailed comments
- NpyIter.cs: Call allocation in Initialize(), free in Dispose()
- NpyIterCoalescing.cs: Use StridesNDim instead of MaxDims
- NpyIterBufferManager.cs: Use StridesNDim for stride indexing
- NpyIterKernels.cs: Use StridesNDim in path selection

Tests:
- Removed MaxDims_Is64 test
- Added UnlimitedDimensions_HighDimensionalArray (20D array test)
- Added UnlimitedDimensions_MaxOperands (verifies MaxOperands=8)
- All 5667 tests pass
NpyAxisIter Implementation:
- NpyAxisIter.cs: Axis-based iterator for cumsum, cumprod, var, std
- NpyAxisIter.State.cs: State struct for axis iteration
- NpyLogicalReductionKernels.cs: Generic numeric reduction kernel interfaces

Logical Reduction Refactoring:
- Default.LogicalReduction.cs: Unified logical reduction implementation
- Default.All.cs/Default.Any.cs: Simplified to use new infrastructure
- np.all.cs/np.any.cs: Cleaned up API layer

Cumulative Operation Fixes:
- Default.Reduction.CumAdd.cs: Added contiguity checks for IL kernel path
- Default.Reduction.CumMul.cs: Added contiguity checks for IL kernel path
- Falls back to NpyAxisIter for sliced/reversed/broadcast views

Variance/Std Fixes:
- Default.Reduction.Std.cs: Updated reduction implementation
- Default.Reduction.Var.cs: Updated reduction implementation

Copy Kernel Infrastructure:
- CopyKernel.cs: Copy kernel key and execution path definitions
- ILKernelGenerator.Copy.cs: IL-generated copy kernels
- np.copyto.cs: Updated to use new copy infrastructure

Utilities:
- InfoOf.cs: Added GetSize helper for dtype size lookup
- MultiIterator.cs: Minor updates
- UnmanagedStorage.Cloning.cs: Minor updates

Documentation:
- docs/NPYITER_FIXES_REQUIRED.md: NpyIter implementation requirements
- docs/NPYITER_PARITY_ANALYSIS.md: NumPy parity analysis
- docs/DEFAULTENGINE_ILKERNEL_PLAYBOOK.md: IL kernel guidelines
- docs/DEFAULTENGINE_ILKERNEL_RULEBOOK.md: IL kernel rules
- docs/plans/NDITER.md: NDIter implementation plan

Tests:
- NpyIterBattleTests.cs: Iterator battle tests
- NpyIterReductionBattleTests.cs: Reduction battle tests
- NpyIterScanBattleTests.cs: Scan operation battle tests
- np.logical_reduction.iterator.tests.cs: Logical reduction tests
- np.copyto.NpyIter.Test.cs: Copy operation tests
- Updated np.all.Test.cs, np.any.Test.cs
- NpyIterRefTests.cs: Fix ref struct lambda capture issue
- np.logical_reduction.iterator.tests.cs: Add [TestClass], replace [Test] with [TestMethod]

All 5742 tests pass (excluding OpenBugs category)
…D collapse

NumPy reorders axes by stride magnitude BEFORE coalescing, which allows
contiguous arrays to fully coalesce to ndim=1. This commit implements
the same behavior in NumSharp.

Problem:
For a C-contiguous (2,3,4) array with strides [12,4,1], the coalescing
formula stride[i]*shape[i]==stride[i+1] fails without reordering:
- (0,1): 12*2=24 != 4 → cannot coalesce

Solution:
Added ReorderAxesForCoalescing() that sorts axes by minimum stride:
- After reorder: shapes [4,3,2], strides [1,4,12]
- (0,1): 1*4=4 == 4 ✓ → coalesce to [12,2], strides [1,12]
- (0,1): 1*12=12 == 12 ✓ → coalesce to [24], strides [1]

Changes:
- NpyIterCoalescing.cs: Added ReorderAxesForCoalescing(state, order)
  - Uses insertion sort (stable, good for nearly-sorted data)
  - Respects NPY_ORDER parameter (ascending for C-order, descending for F-order)
  - Marked old ReorderAxes() as obsolete

- NpyIter.cs: Initialize() now calls ReorderAxesForCoalescing() before
  CoalesceAxes() when multi-index is not tracked

- NpyIterRefTests.cs: Updated tests to expect ndim=1 for contiguous arrays

Test results: 5742 passed, 11 skipped, 0 failed
Add missing NumPy nditer features to NpyIterRef:
- RemoveMultiIndex(): Enable coalescing after construction by calling
  ReorderAxesForCoalescing + CoalesceAxes, matching NumPy behavior
- Finished property: Check if iteration is complete
- Shape property: Get current iterator shape after coalescing
- IterRange property: Get (Start, End) tuple
- Iternext(): Advance and return bool for remaining elements
- GetValue<T>/SetValue<T>: Type-safe value access at current position
- GetDataPtr(): Raw pointer access to current operand data

Fix RemoveAxis() to recalculate IterSize after dimension removal.

Add 45 comprehensive NumPy parity tests derived from actual NumPy 2.4.2
output, covering:
- Coalescing behavior (contiguous, transposed, sliced, scalar, empty)
- C_INDEX and F_INDEX tracking (2D and 3D arrays)
- RemoveMultiIndex and RemoveAxis
- Finished property and Iternext pattern
- Shape property changes after coalescing
- Ranged iteration
- Value access (GetValue, SetValue)
- Multi-operand iteration
- Sliced and broadcast arrays
- Edge cases (empty, scalar)

Document known divergences:
- F-order with MULTI_INDEX: skips axis reordering to preserve indices
- K-order on F-contiguous with MULTI_INDEX: same limitation

Create docs/NPYITER_NUMPY_DIFFERENCES.md with complete analysis of
NumPy nditer vs NumSharp NpyIter implementation differences.

Test results: 196 NpyIter tests passing, 5796 total tests passing
…INDEX

Add complete NumPy parity for iteration order when MULTI_INDEX is set:

F-order (NPY_FORTRANORDER):
- First axis changes fastest (column-major iteration)
- Reverses axis order so original axis 0 is innermost
- Uses Perm array to map internal coords to original axis order

K-order (NPY_KEEPORDER):
- Follows memory layout (smallest stride innermost)
- Sorts axes by stride, largest first when MULTI_INDEX is set
- Enables memory-order iteration on transposed/F-contiguous arrays

Key implementation changes:
- Initialize Perm to identity [0,1,2,...] in AllocateDimArrays
- Add forCoalescing parameter to ReorderAxesForCoalescing:
  - true (no MULTI_INDEX): ascending sort for coalescing formula
  - false (with MULTI_INDEX): descending sort for iteration order
- GetMultiIndex: Apply inverse permutation (outCoords[Perm[d]] = Coords[d])
- GotoMultiIndex: Apply permutation (Coords[d] = coords[Perm[d]])
- Shape property: Return shape in original axis order when MULTI_INDEX set

Test results:
- F-order: values 0,3,1,4,2,5 on (2,3) array (matches NumPy)
- K-order on transposed: values 0,1,2,3,4,5 following memory (matches NumPy)
- 196 NpyIter tests passing, 5796 total tests passing
Add GotoIndex() method that jumps to a specific flat index position
based on C_INDEX or F_INDEX flag. This enables random access by flat
array index while iterating.

Implementation details:
- Converts flat index to original coordinates using appropriate formula:
  - C-order: decompose using row-major index strides
  - F-order: decompose using column-major index strides
- Uses Perm array to map original coords to internal coords
- Updates data pointers correctly after position change

Fix ComputeFlatIndex to use original coordinate order:
- Build original coords/shape from internal using Perm array
- Compute C or F index in original array's coordinate system
- Fixes c_index tracking during F-order iteration

Fix Advance() to compute FlatIndex AFTER coords are updated:
- FlatIndex was being computed before coord increment (off by one)
- Now correctly computes after coordinate update
- Fast path (identity perm + C_INDEX) still uses simple increment

Add comprehensive tests:
- GotoIndex with C_INDEX (2D and 3D arrays)
- GotoIndex with F_INDEX
- C_INDEX tracking during F-order iteration

Test results: 200 NpyIter tests passing, 5800 total tests passing
Add Copy() method that creates an independent copy of the iterator
at its current position, matching NumPy's NpyIter_Copy behavior:
- Allocates new NpyIterState on heap
- Copies all fixed-size fields
- Allocates new dimension arrays (Shape, Coords, Perm, Strides)
- Returns new NpyIterRef that owns the copied state
- Advancing/resetting the copy does not affect the original

Add comprehensive tests:
- Copy preserves position
- Copy is independent (advancing original doesn't affect copy)
- Copy preserves flags (MULTI_INDEX, C_INDEX)
- Resetting copy doesn't affect original

Test results: 203 NpyIter tests passing, 5803 total tests passing
…eration

NumPy's nditer flips axes with all-negative strides for cache-efficient
memory-order iteration while tracking flipped coordinates via negative
Perm entries. This implementation adds full NumPy parity.

Key changes:
- FlipNegativeStrides(): Negate all-negative axes, adjust base pointers,
  mark with negative Perm entries, set NEGPERM flag
- GetMultiIndex/GotoMultiIndex: Handle NEGPERM by reversing coords for
  flipped axes (shape - coord - 1)
- GotoIndex: Handle NEGPERM in flat index to multi-index conversion
- ComputeFlatIndex: Handle NEGPERM for correct C/F index computation
- InitializeFlatIndex: Compute initial FlatIndex after axis setup
- HasNegPerm/HasIdentPerm: New properties for perm state inspection
- DONT_NEGATE_STRIDES: Flag support to preserve view logical order

13 new NumPy parity tests covering:
- 1D/2D/3D reversed arrays
- Row/col/both reversed 2D arrays
- GotoIndex/GotoMultiIndex with flipped axes
- Mixed operands (one positive stride prevents flip)
- DONT_NEGATE_STRIDES flag behavior
- Iteration without MULTI_INDEX flag

All 214 NpyIter tests pass, 5814 total tests pass.
GetIterView returns an NDArray view of the i-th operand with the
iterator's internal axes ordering. A C-order iteration of this view
is equivalent to the iterator's iteration order.

Key features:
- Returns view with iterator's internal shape and strides (after
  coalescing/reordering)
- For coalesced arrays: returns lower-dimensional view (e.g., 3D->1D)
- For sliced/transposed arrays: reflects internal optimization
- Throws InvalidOperationException when buffering is enabled
- Validates operand index bounds

8 new NumPy parity tests covering:
- Contiguous array (coalesced to 1D)
- MULTI_INDEX (preserves original shape)
- Transposed array with K-order
- Sliced arrays with non-contiguous strides
- Multiple operands
- Buffered iterator exception
- Invalid operand index exception
- Reversed array with flipped strides

All 222 NpyIter tests pass, 5822 total tests pass.
…ered iteration

NumPy's nditer supports automatic type conversion when iterating arrays
with different dtypes. This implementation adds full NumPy parity for
casting during buffered iteration.

Key changes:

- NpyIterCasting.cs (new): Complete casting validation and conversion
  - CanCast(): Validates casting rules (no_casting, equiv, safe, same_kind, unsafe)
  - IsSafeCast(): Safe casts like smaller int -> larger int, any int -> float64
  - IsSameKindCast(): Both integers or both floats
  - ValidateCasts(): Throws InvalidCastException for invalid casts
  - FindCommonDtype(): For COMMON_DTYPE flag
  - ConvertValue(): Single value conversion via double intermediate
  - CopyWithCast(): Strided array copy with type conversion

- NpyIter.State.cs:
  - OpSrcDTypes[]: Track source array dtypes for casting
  - SrcElementSizes[]: Source element sizes for stride calculation
  - GetOpSrcDType/SetOpSrcDType accessors
  - NeedsCast(op): Check if operand requires type conversion

- NpyIterBufferManager.cs:
  - CopyToBufferWithCast(): Copy from source to buffer with conversion
  - CopyFromBufferWithCast(): Copy from buffer to destination with conversion
  - Handles 1D and multi-dimensional strided arrays

- NpyIter.cs:
  - Initialize handles COMMON_DTYPE flag to auto-find common dtype
  - Stores source dtypes and validates casting rules
  - GetDataPtr returns buffer pointer when buffering enabled
  - CRITICAL BUG FIX: Dispose was using NativeMemory.Free for buffers
    allocated with AlignedAlloc. Now correctly uses FreeBuffers which
    calls AlignedFree. This was causing memory corruption and test crashes.

13 new NumPy parity tests:
- Cast_Int32ToFloat64_SafeCasting
- Cast_Float64ToInt32_UnsafeCasting
- Cast_Float64ToInt32_SafeCasting_Throws
- Cast_Int16ToInt32_SafeCasting
- Cast_CommonDtype_TwoOperands
- Cast_WriteOutput_WithConversion
- Cast_SameKindCasting_IntToInt
- Cast_SameKindCasting_IntToFloat_Throws
- Cast_NoCasting_SameType_Allowed
- Cast_NoCasting_DifferentType_Throws
- Cast_RequiresBuffered_ThrowsWithoutBuffer
- And more...

All 233 NpyIter tests pass, 5833 total tests pass.
Adds basic reduction iteration support matching NumPy's nditer API:

- Reduction detection via op_axes with -1 entries for READWRITE operands
- REDUCE_OK flag validation: throws if reduction detected without flag
- IsFirstVisit(operand): checks if current element is first visit
  (for initialization, e.g., set to 0 before summing)
- IsReduction property: check if iterator has reduction operands
- IsOperandReduction(op): check if specific operand is reduction
- REDUCE flags set on iterator (NpyIterFlags.REDUCE) and operands
  (NpyIterOpFlags.REDUCE) when reduction is detected

Key implementation details:
- Modified ApplyOpAxes to detect reduction axes and validate REDUCE_OK
- Fixed isWriteable check to only match WRITE flag (READWRITE includes both)
- Modified ValidateIterShape to account for op_axes -1 entries
- Modified Initialize to set up strides directly from op_axes when provided
  instead of using np.broadcast_to (which fails for reduction shapes)

Tests added (7 new NumPy parity tests):
- Reduction_1DToScalar_IteratesCorrectly
- Reduction_2DToScalar_IteratesCorrectly
- Reduction_2DAlongAxis1_ProducesCorrectResult
- Reduction_IsFirstVisit_ReturnsTrueOnFirstElement
- Reduction_WithoutReduceOK_Throws
- Reduction_ReadOnlyOperand_DoesNotThrow
- Reduction_HasReduceFlag_WhenReductionDetected

All 240 NpyIter tests and 5840 total tests passing.
- Add READWRITE validation: reduction operands must have both READ and
  WRITE flags (WRITEONLY throws ArgumentException). NumPy requires this
  because reduction must read existing value before accumulating.

- Add buffer reduction fields to NpyIterState:
  - ReducePos: current position in reduce outer loop
  - ReduceOuterSize: size of reduce outer loop
  - ReduceOuterStrides: per-operand reduce outer strides
  - GetReduceOuterStride/SetReduceOuterStride accessors

- Update IsFirstVisit to check buffer reduce_pos:
  Part 1: Check coordinates (existing) - if stride=0 and coord!=0, not first
  Part 2: Check buffer reduce_pos (new) - when BUFFERED flag set, if
  ReducePos!=0 and operand's reduce outer stride is 0, not first visit

- Add Reduction_WriteOnlyOperand_Throws test

241 NpyIter tests passing, 5843 total tests passing (excluding OpenBugs)
…Py parity

Implements NumPy's double-loop pattern for efficient buffered reduction
(nditer_templ.c.src lines 131-210). This avoids re-buffering during
reduction by separating iteration into inner (core) and outer loops.

Key changes:

NpyIterState new fields:
- CoreSize: Number of inputs per output element (reduce dimension size)
- CorePos: Current position within core [0, CoreSize) for IsFirstVisit
- SetBufStride/GetBufStride: Accessors for buffer strides

SetupBufferedReduction:
- CoreSize = Shape[outerDim] (reduce axis size)
- ReduceOuterSize = transferSize / CoreSize (number of outputs)
- BufStrides: 0 for reduce ops (stay at same output), elemSize for others
- ReduceOuterStrides: elemSize for reduce ops (next output),
  elemSize*CoreSize for non-reduce ops

BufferedReduceAdvance:
- Inner loop: increment CorePos, advance by BufStrides
- Outer loop: reset CorePos to 0, advance by ReduceOuterStrides
- Returns 0 when buffer exhausted for refill

IsFirstVisit for buffered mode:
- Uses CorePos = 0 check instead of coordinates
- First visit is only at start of each output element's accumulation

CopyReduceBuffersToArrays:
- For reduce operands: copy ReduceOuterSize elements (number of outputs)
- For non-reduce operands: copy full CoreSize * ReduceOuterSize elements
- Uses ResetDataPtrs as destination (original array, not buffer)

GetDataPtr for buffered reduce:
- Returns DataPtrs directly (tracked by BufferedReduceAdvance)
- Instead of computing from IterIndex which doesn't work with double-loop

Tests added:
- BufferedReduction_1DToScalar_ProducesCorrectResult
- BufferedReduction_2DAlongAxis1_ProducesCorrectResult
- BufferedReduction_IsFirstVisit_WorksWithBuffering
- BufferedReduction_LargeArray_ExceedsBuffer (tests buffer refill)
- BufferedReduction_WithCasting_WorksCorrectly
- BufferedReduction_DoubleLoopFields_AreSetCorrectly

247 NpyIter tests passing, 5847 total tests passing (excluding OpenBugs)
…coreSize)

Problem:
When buffer size is smaller than core size (reduce dimension size), the
buffered reduction double-loop pattern broke down:
- BufIterEnd was set to CoreSize instead of bufferSize
- coreOffset tracking was misaligned with actual core boundaries
- Reduce operand reload decisions were incorrect

Root Cause:
The coreOffset tracking was based on buffer refill counts, but core boundaries
are determined by iteration coordinates. When bufferSize < coreSize, multiple
buffer refills occur per core, causing the tracking to desync.

Fix:
1. Use pointer comparison to detect new output positions:
   - After GotoIterIndex, compare current array position with previous writeback
   - Only reload reduce operand if pointer changed (new output element)

2. Add ArrayWritebackPtrs field to store writeback positions separately:
   - ResetDataPtrs must stay as base pointers for GotoIterIndex
   - ArrayWritebackPtrs stores the actual writeback destinations

3. Set BufIterEnd to min(BufferSize, CoreSize) for small buffer support

Test Results:
- All 252 NpyIter tests pass
- Small buffer test (3,8)->(3,) with bufferSize=4 now produces [28, 92, 156]
- NumPy parity confirmed for buffered reduction edge cases

Analysis documented:
- EXLOOP, BUFNEVER, BUF_REUSABLE are performance optimizations not bugs
- Current implementation is functionally correct with NumPy
Audit Summary (2026-04-16):
- 252 tests passing (101 parity, 70 battle, 41 ref tests)
- 32 NumPy APIs fully implemented
- All core features complete: iteration, indexing, buffering, casting, reduction

Key findings:
- Implementation is PRODUCTION READY
- No critical missing features for NumSharp operations
- Full NumPy parity for buffered reduction including small buffer handling
- Intentional divergences documented (unlimited dims, 8 max operands)

Remaining low-priority items (performance only):
- BUFNEVER per-operand buffer skip
- Enhanced buffer reuse logic
- EXLOOP increment optimization
BREAKING CHANGE: Removed MaxOperands=8 limit

NumPy uses NPY_MAXARGS=64 (was 32 in 1.x) as a runtime constant. NumSharp
now achieves full parity by supporting truly unlimited operands through
dynamic allocation of all per-operand arrays.

Changes:
- NpyIterState: Convert all 14 fixed per-operand arrays to dynamically
  allocated pointers (DataPtrs, ResetDataPtrs, BaseOffsets, OpItFlags,
  OpDTypes, OpSrcDTypes, ElementSizes, SrcElementSizes, InnerStrides,
  Buffers, BufStrides, ReduceOuterStrides, ReduceOuterPtrs, ArrayWritebackPtrs)
- AllocateDimArrays: Now allocates both dimension arrays AND operand arrays
  in separate contiguous blocks with proper alignment
- FreeDimArrays: Now frees both blocks
- All accessor methods simplified (no more fixed() statements needed)
- Copy method: Fixed to properly copy operand arrays for all cases
  including scalar (NDim=0)
- NpyIterCoalescing: Updated to use direct pointer access

Tests:
- Added UnlimitedOperands_100Operands_IteratesCorrectly test
- Updated TooManyOperands_Throws to ManyOperands_Works
- Updated UnlimitedDimensions_MaxOperands to verify 16 operands work
- 253 NpyIter tests passing

Memory layout for operand arrays (per NOp elements):
- 9 long* arrays (72 bytes each for 64-bit pointers)
- 2 int* arrays
- 1 ushort* array
- 2 byte* arrays
All sections 8-byte aligned for optimal cache performance.
Deep audit validates NumSharp NpyIter against NumPy 2.x using:
1. Behavioral Comparison - 55 NumPy vs NumSharp tests
2. Edge Case Matrix - 12 systematic edge cases
3. Source Code Comparison - NumPy C vs C# structural analysis
4. Property Invariants - 13 mathematical invariant tests

Total validation: 333 tests (253 unit + 80 behavioral/invariant)

Key findings:
- All tests pass confirming full NumPy parity
- NEGPERM behavior verified (reversed arrays iterate in memory order)
- Buffered reduce double-loop matches NumPy structure
- Coalescing algorithm structurally identical
- All 6 property invariants hold

New files:
- docs/NPYITER_DEEP_AUDIT.md - Comprehensive audit report
- test/audit_behavioral.cs - Runtime audit script
- test/NumSharp.UnitTest/.../NpyIterNumPyBattleTests.cs - Battle tests
Three battle tests document NumSharp's iteration order differences:

1. F-order iteration: NumSharp is C-order only (documented limitation)
   - NumPy: [0, 3, 1, 4, 2, 5] (F-order)
   - NumSharp: [0, 1, 2, 3, 4, 5] (C-order)

2. Multi-operand broadcasting: Different iteration order
   - NumPy: [(0,0), (1,1), (2,2), (0,3), (1,4), (2,5)]
   - NumSharp: [(0,0), (0,3), (1,1), (1,4), (2,2), (2,5)]

3. Non-contiguous view: Memory order vs logical C-order
   - NumPy: [1, 3, 11, 13] (logical C-order)
   - NumSharp: [1, 11, 3, 13] (memory order)

All tests now pass (277 total NpyIter tests).
NumPy's nditer coalescing strategy:
- K-order: Always coalesce for memory efficiency (sort by stride)
- C-order on C-contiguous: Coalesce → memory order (== C-order)
- F-order on F-contiguous: Coalesce → memory order (== F-order)
- F-order on C-contiguous: NO coalescing, reverse axes for F-order

Previously NumSharp was coalescing for ALL orders when array was
contiguous in any layout, which produced incorrect iteration order
for F-order on C-contiguous arrays.

Changes:
- NpyIter.cs: Add CheckAllOperandsContiguous(bool cOrder) helper
  to check if arrays are contiguous in requested order
- NpyIter.cs: Only coalesce when order matches array contiguity
- NpyIterCoalescing.cs: Add IsContiguousForCoalescing() check

Test results:
- 277 NpyIter tests passing (including 24 new battle tests)
- 5813 total tests passing
- F-order now produces [0,3,1,4,2,5] instead of [0,1,2,3,4,5]
  for a 2x3 C-contiguous array (matches NumPy)
…arrays

Problem:
- K-order iteration on broadcast arrays produced wrong order
  (stride-based sorting with stride=0 breaks axis ordering)
- K-order iteration on non-contiguous views also used wrong order
- NumPy: (3,) x (2,3) broadcast iterates C-order: [(0,0),(1,1),(2,2),(0,3),(1,4),(2,5)]
- NumSharp was producing: [(0,0),(0,3),(1,1),(1,4),(2,2),(2,5)]

Root cause:
- For K-order, we sorted axes by stride magnitude
- But GetMinStride excludes stride=0, leading to incorrect axis ordering
- Non-contiguous views similarly got wrong ordering from stride sort

Solution:
- For K-order with broadcast dimensions (stride=0), fall back to C-order
- For K-order with non-contiguous arrays, fall back to C-order
- Added HasBroadcastStrides() helper to detect broadcast dimensions
- CheckAllOperandsContiguous now uses absolute strides to handle
  reversed arrays (negative strides become positive after FlipNegativeStrides)
- Separate coalescing logic for C/F/K orders to preserve iteration semantics

Changes:
- NpyIter.cs: Added broadcast detection, fixed coalescing decision logic
- NpyIterNumPyBattleTests.cs: Updated tests to expect correct NumPy behavior
  (removed [Misaligned] attributes from Battle_MultiOperandBroadcasting and
  Battle_NonContiguousViewOrder since they now match NumPy)

All 277 NpyIter tests passing.
All 5877 project tests passing.
Deep audit against NumPy 2.4.2 source revealed 7 behavioral bugs. All fixed via TDD.

Bug #1: Negative strides always flipped regardless of order
- NumPy (nditer_constr.c:297-307) only flips when NPY_ITFLAG_FORCEDORDER not set
- FORCEDORDER is set by C, F, and A orders. Only K-order skips it.
- Fix: Only call FlipNegativeStrides for K-order
- CheckAllOperandsContiguous now takes allowFlip param (abs strides only when flipping)
- Affects: 1D/2D reversed arrays with C/F/A orders

Bug #2: NO_BROADCAST flag not enforced
- Code was skipping NO_BROADCAST operands instead of enforcing the constraint
- Fix: NO_BROADCAST operands must match iterShape without dim-1 stretching
- ValidateIterShape now always runs (not just when iterShape is provided)

Bug #3: F_INDEX returned C-order indices
- Coalescing reduces to NDim=1, losing original axis structure needed for F-index
- Fix: Disable coalescing when C_INDEX or F_INDEX is set (like MULTI_INDEX)

Bug #4: ALLOCATE with null operand threw NullReferenceException
- CalculateBroadcastShape accessed null op[i].ndim
- Fix: Skip null operands in broadcast shape calc, then allocate them after
  with correct shape (from op_axes if provided) and dtype

Bug #5,6,7: op_axes reductions broken (axis=0 gave [15,0,0], axis=1 threw)
- ApplyOpAxes was re-applying op_axes to strides that were already correctly set
  in the main operand setup loop, zeroing out non-reduce strides
- CalculateBroadcastShape didn't know about op_axes, couldn't compute iter shape
- Fix: ApplyOpAxes now only validates and sets REDUCE flags, not strides
- Fix: CalculateBroadcastShape now accepts opAxesNDim/opAxes parameters
- Uses production Shape.ResolveReturnShape API for all broadcasting

Refactoring: Uses production Shape.ResolveReturnShape / np.broadcast_to
- Replaces custom broadcast shape calculation
- User feedback: production APIs are 1-to-1 with NumPy

Testing:
- 21 new TDD tests in NpyIterParityFixTests.cs
- All 298 NpyIter tests pass
- All 5898 project tests pass
- Final battletest: 21/21 scenarios match NumPy 2.4.2 exactly

Fixed test: NullOperand_Throws now expects ArgumentException (more accurate than
NullReferenceException since null operand without ALLOCATE is an argument error).
Adds F-contiguity detection and OrderResolver for NumPy's 4 memory orders
at minimum functionality, with zero behavioral change to existing code.

Changes:
- Shape.cs: F-contig detection via ComputeIsFContiguousStatic (mirror of C-contig
  algorithm, scan left-to-right). Sets ArrayFlags.F_CONTIGUOUS flag during flag
  computation. New IsFContiguous property (O(1) flag check). New
  ComputeFContiguousStrides helper. New Shape(long[] dims, char order)
  constructor for explicit physical-order construction.
- Scalar constructor now sets both C_CONTIGUOUS and F_CONTIGUOUS (matches NumPy).
- OrderResolver.cs (NEW): Resolves NumPy order chars (C/F/A/K) to physical
  storage orders (C or F). 'A' and 'K' require a source Shape for resolution
  (matches NumPy: creation functions without source throw "only 'C' or 'F'
  order is permitted").
- np.empty.cs: New overload np.empty(shape, order, dtype) wiring
  OrderResolver through to Shape.

Key insight: transpose already produces F-contig memory layout; previously this
went undetected because F_CONTIGUOUS flag was never set. Now:
  arr = np.arange(24).reshape(2,3,4)
  arr.T.Shape.IsFContiguous  // true (previously: false / undetected)

Design:
- Only C and F are physical storage layouts; A and K are logical decisions
  that resolve to C or F based on source array layout.
- OrderResolver centralizes the C/F/A/K -> C/F mapping, letting future
  wiring of np.copy/np.array/flatten/ravel/reshape be a 1-line call.
- Existing IsContiguous callers (116 usages across 50 files) unchanged -
  they still see C_CONTIGUOUS=false for F-contig arrays and take the
  strided path (which is correct, just not yet SIMD-accelerated).

Tests (24 new in Shape.Order.Tests.cs):
- Scalar and 1-D arrays are both C and F contig
- Multi-dim C-contig is not F-contig and vice versa
- Transpose of C-contig now reports IsFContiguous=true
- Shape(dims, 'F') produces correct F-order strides (1, 2, 6 for 2x3x4)
- Shape(dims, 'A'/'X') throws ArgumentException
- OrderResolver: C/F resolve directly; A/K without source throw; A/K with
  source resolve based on source layout
- np.empty(order='C'/'F') produces correct layout
- np.empty(order='A'/'K') throws (matches NumPy)

Verification:
- 6017 tests pass on both net8.0 and net10.0 (zero regressions)
- NumPy parity verified via Python side-by-side comparison
- All order resolution semantics match NumPy 2.4.2

Future phases unblocked (each a ~1-line change):
- ILKernelGenerator fast paths can add || IsFContiguous for element-wise ops
- NpyIter.CheckAllOperandsContiguous can use Shape.IsFContiguous directly
- np.copy(order), np.array(order), flatten(order), ravel(order) wiring
- np.asfortranarray, np.ascontiguousarray
Review of initial F-order support surfaced three design issues where
NumSharp diverged from NumPy's patterns. This refactor aligns with
NumPy's flagsobject.c:_UpdateContiguousFlags exactly.

Changes:

1. Unified contiguity computation (single-pass)
   - Replaced two separate functions (ComputeIsContiguousStatic,
     ComputeIsFContiguousStatic) with one combined
     ComputeContiguousFlagsStatic returning (isC, isF) tuple.
   - Mirrors NumPy's _UpdateContiguousFlags which computes both in one
     function with a shared dim==0 early exit.
   - Fewer call sites, one traversal per contiguity check, cleaner
     shared logic.

2. Fixed Shape.Order property (was hardcoded to layout = 'C')
   - Now derives from actual contiguity flags: returns 'F' if strictly
     F-contiguous (IsFContiguous && !IsContiguous), else 'C'.
   - Transposed C-contig arrays now correctly report Order='F'.
   - 1-D and scalar shapes (both C and F contig) report 'C' by
     convention (NumPy-default reference order).
   - Non-contiguous shapes report 'C' as default reference.

3. Fixed empty array flag computation (any dim == 0)
   - NumPy short-circuits _UpdateContiguousFlags on dim==0 setting
     BOTH C_CONTIGUOUS and F_CONTIGUOUS unconditionally and NOT setting
     BROADCASTED. Empty arrays have no elements so broadcast semantics
     are meaningless.
   - Previously NumSharp computed strides like (0, 3, 1) for shape
     (2, 0, 3), triggered IsBroadcasted=true, and then skipped
     contiguity flag assignment entirely. Result was an empty array
     reporting IsContiguous=false, IsFContiguous=false.
   - Now matches NumPy: any dim=0 short-circuits to set both C and F
     contig + WRITEABLE + ALIGNED, clear BROADCASTED.

4. Clarified `layout` const documentation
   - The internal const char layout = 'C' was misleadingly named (as if
     it described the shape's physical order) but only ever used as a
     hash seed in ComputeSizeAndHash. Updated doc comment to clarify
     this is NOT the physical memory order — use Order / IsContiguous
     / IsFContiguous for actual layout info.
   - Value unchanged to preserve existing hash stability.

Additional tests (6 new):
- Order property for C, F, transpose, 1-D, scalar cases
- Empty array is both C and F contiguous (matching NumPy 2.4.2)

Test results:
- 6023 tests pass on both net8.0 and net10.0 (was 6017; 6 new tests)
- Zero regressions

NumPy source reference: numpy/_core/src/multiarray/flagsobject.c
…enarios

Ports the last NumPy nditer surface gaps identified by the audit, each with
1-to-1 semantic parity verified against NumPy 2.4.2 via Python harness.

10 items implemented (all battletested):
  1. NpyIter_ResetBasePointers  (nditer_api.c:314)
     - Populate BaseOffsets during FlipNegativeStrides so ResetBasePointers
       can recompute ResetDataPtrs[iop] = baseptrs[iop] + baseoffsets[iop].
     - Public: NpyIterRef.ResetBasePointers(ReadOnlySpan<IntPtr>) and
       ResetBasePointers(NDArray[]) convenience overload.

  2. NPY_ITFLAG_TRANSFERFLAGS_SHIFT packing  (nditer_constr.c:3542)
     - Pack NpyArrayMethodFlags into top 8 bits of ItFlags (shift=24).
     - Public: NpyIterRef.GetTransferFlags() + NpyArrayMethodFlags enum
       + NpyIterConstants.TRANSFERFLAGS_SHIFT/MASK constants.
     - REQUIRES_PYAPI never set in .NET (no Python GIL). SUPPORTS_UNALIGNED
       and NO_FLOATINGPOINT_ERRORS always set (raw pointer loops, .NET casts
       don't raise FPE). IS_REORDERABLE set for numeric casts.

  3. NpyIter_GetGetMultiIndex factory  (nditer_templ.c.src:481)
     - Specialized delegate factory returning NpyIterGetMultiIndexFunc with
       3 dispatches: IDENTPERM (direct copy), positive perm (apply perm[]),
       NEGPERM (apply perm+flip). BUFFER and HASINDEX don't affect coords
       so no specialization needed for them.
     - Public: GetMultiIndexFunc(), GetMultiIndexFunc(out errmsg),
       InvokeMultiIndex(fn, coords) — ref-struct-safe invocation.
     - Also fixes: IDENTPERM flag is now set at construction (after
       AllocateDimArrays). Previously only set post-coalescing, leaving
       MULTI_INDEX iterators without the fast-path flag.

  4. NpyIter_GetInnerFixedStrideArray  (nditer_api.c:1357)
     - Public: GetInnerFixedStrideArray(Span<long>).
     - Buffered: copies BufStrides. Non-buffered: innermost-axis stride per
       operand. Returns BYTE strides (NumPy convention), multiplying
       NumSharp's element-count strides by ElementSizes[op].

  5. NpyIter_GetAxisStrideArray  (nditer_api.c:1309)
     - Public: GetAxisStrideArray(int axis, Span<long>).
     - With HASMULTIINDEX: walks perm to find internal axis (handles both
       positive and NEGPERM-encoded entries). Without: Fortran-order
       (fastest-first) lookup via NDim-1-axis. Byte strides.

  6. NpyIter_CreateCompatibleStrides  (nditer_api.c:1058)
     - Public: CreateCompatibleStrides(long itemsize, Span<long>).
     - Requires HASMULTIINDEX, rejects flipped axes. Walks perm from
       innermost (NDim-1) outward, accumulating itemsize into outStrides[axis]
       in original (C-order) axis slots.

  7. NpyIter_DebugPrint  (nditer_api.c:1402)
     - Public: DebugPrint(), DebugPrint(TextWriter), DebugPrintToString().
     - Faithful port of NumPy's dump format: ItFlags decoded, NDim/NOp,
       IterSize/Start/End/Index, Perm, DTypes, DataPtrs, BaseOffsets,
       OpItFlags, BufferData (when BUFFER), per-axis data.

  8. NPY_ITER_REDUCTION_AXIS encoding  (common.h:347, nditer_constr.c:1431)
     - Additive encoding: axis + (1 << 30). Values >= (1<<30)-1 flagged as
       reduction axes. Value 0x40000000 for axis 0, 0x3FFFFFFF for axis -1.
     - Public: NpyIterUtils.ReductionAxis(int) encoder and GetOpAxis(int,
       out bool) decoder. NpyIterConstants.REDUCTION_AXIS_OFFSET = 1<<30.
     - Integrated into CalculateBroadcastShape (rejects length != 1 on
       reduction axes), ValidateIterShape, and ApplyOpAxes (enforces
       REDUCE_OK + sets REDUCE flag).

  9. WRITEMASKED + ARRAYMASK + check_mask_for_writemasked_reduction
     - TranslateOpFlags now maps NpyIterPerOpFlags.WRITEMASKED ->
       NpyIterOpFlags.WRITEMASKED on op flags.
     - PreCheckMaskOpPairing validates: WRITEMASKED requires one ARRAYMASK,
       ARRAYMASK requires >=1 WRITEMASKED, at most one ARRAYMASK, no
       operand with both flags.
     - SetMaskOpFromFlags sets NpyIterState.MaskOp index of ARRAYMASK operand.
     - CheckMaskForWriteMaskedReduction enforces (nditer_constr.c:1328):
       for any WRITEMASKED + REDUCE operand, no axis may have maskstride!=0
       && opstride==0 (would produce multiple mask values per reduction element).
     - Public: NpyIterRef.MaskOp, HasWriteMaskedOperand.

 10. NPY_ITER_OVERLAP_ASSUME_ELEMENTWISE per-op flag
     - Added NpyIterPerOpFlags.OVERLAP_ASSUME_ELEMENTWISE_PER_OP = 0x40000000
       in the correct per-operand flag slot (NumPy's location). Accepted
       syntactically as a marker for COPY_IF_OVERLAP fast-path elision.

Correctness bugs fixed while battletesting:

  A. SetupBufferedReduction produced inverted strides for non-reduce operands.
     BufStride was set to elemSize (assumed linear buffer); correct value is
     the operand's stride along the REDUCE axis (inner loop = reduce axis
     traversal). ReduceOuterStride was set to elemSize*coreSize; correct is
     stride along the non-reduce axis.

  B. SetupBufferedReduction only worked for 2-axis cases (one reduce, one
     non-reduce). For 3D+ with multiple non-reduce axes, added CoreSize=0
     short-circuit that defers to regular N-D Advance() — which correctly
     carries multiple axes via Coords + per-axis strides. stride=0 on
     reduce axis naturally keeps y's pointer fixed during reduce iteration.

  C. GetDataPtr for BUFFER+REDUCE with CoreSize=0 returned a buffer pointer
     indexed by IterIndex (linear assumption). For reduce this is wrong —
     DataPtrs already track the correct position. Now returns DataPtrs
     whenever REDUCE flag is set.

  D. Reset() didn't reposition to IterStart. IterIndex was set to IterStart
     but DataPtrs/Coords were reset to array origin, desyncing the iterator
     state for ranged iterators with IterStart > 0. Now delegates to
     GotoIterIndex(IterStart) which sets all three consistently.

  E. K-order fallback to C-order was too aggressive — triggered for all
     non-contiguous arrays, defeating NumPy's K-order semantic of iterating
     in memory order. Fixed to fall back only when broadcast axes (stride=0)
     are present; merely non-contiguous (transposed, strided, negative-
     stride) now properly sorts axes by |stride| descending.

  F. CoalesceAxes rejected size-1 axes unless stride==0. Size-1 axes
     contribute no iteration and should always be absorbed into a neighbor.
     Fix restores proper 1D coalescing for shapes like (2,4,1) contiguous.

  G. FlipNegativeStrides now populates BaseOffsets[op] (previously an
     allocated-but-unused field). Prereq for item #1 (ResetBasePointers).

Battletest harness:
  - Python<->NumSharp scenario harness in a temp workspace with 3 structured
    waves (25 scenarios each) plus a 491-scenario random fuzz test with
    deterministic seed (42). All scenarios compare element sequences, stride
    arrays, multi-indices, reduce outputs, and iteration state byte-for-byte
    against NumPy 2.4.2 output.
  - Coverage: 1D-5D shapes; int8/16/32/64, uint16, float32/64 dtypes;
    contiguous, transposed (2D+3D), strided, negative-stride, size-1 axes,
    and all combinations; MULTI_INDEX, C_INDEX, F_INDEX; RANGED + goto;
    explicit/implicit reduction axes; multi-operand broadcast.
  - Result: 566/566 scenarios pass (25+25+25+491). All semantically
    equivalent to NumPy's C-level nditer output.

Added tests (94 new unit tests):
  - NpyIterAxisStrideArrayTests (12)
  - NpyIterCreateCompatibleStridesTests (9)
  - NpyIterDebugPrintTests (12)
  - NpyIterGetMultiIndexFuncTests (10)
  - NpyIterInnerFixedStrideArrayTests (9)
  - NpyIterOverlapAssumeElementwiseTests (5)
  - NpyIterReductionAxisEncodingTests (11)
  - NpyIterResetBasePointersTests (10)
  - NpyIterTransferFlagsTests (8)
  - NpyIterWriteMaskedTests (8)

Regression: 6023/6023 project tests pass (was 5898 before this work),
zero regressions. Project passes ~125 more tests than baseline because
fixes C-F unblocked test cases that were previously failing silently.
…rface

New file: OrderSupport.OpenBugs.Tests.cs (39 tests, 11 marked [OpenBugs])

Comprehensive TDD test file documenting the gap between NumSharp's
current behavior and NumPy 2.x's expected behavior for memory order
support. Each test uses NumPy's exact output as the expected value
(verified via side-by-side Python scripts).

Test sections:
1. Creation APIs (np.zeros/ones/empty/full) — 10 tests
2. Copy/conversion (np.copy, NDArray.copy) — 7 tests
3. Manipulation (flatten, ravel) — 5 tests
4. Arithmetic output layout — 3 tests
5. Reductions on F-contig (math-equivalent) — 6 tests
6. Slicing contiguity preservation — 2 tests
7. Broadcasting output layout — 1 test
8. Transpose behavior — 3 tests
9. Iteration order (C-order via GetOffset) — 1 test
10. Order property derivation — 3 tests

Results (net8.0 and net10.0):
- 28 tests pass (documents working behavior / NumPy parity)
- 11 tests fail (marked [OpenBugs], excluded from CI via filter)

Currently failing [OpenBugs] — API gaps to close in future phases:

Section 2 — np.copy / NDArray.copy ignore order parameter:
  - NpCopy_FOrder_ProducesFContig
  - NpCopy_AOrder_FSource_ProducesFContig
  - NpCopy_KOrder_FSource_ProducesFContig
  - NDArrayCopy_FOrder_ProducesFContig
  - NDArrayCopy_AOrder_FSource_ProducesFContig

Section 3 — flatten/ravel ignore/lack order:
  - Flatten_CContig_FOrder_MatchesNumPy
  - Flatten_FContig_FOrder_MatchesNumPy
  - Ravel_FOrder_ApiGap (ravel has no order parameter at all)

Section 4 — arithmetic always produces C-contig output:
  - Arithmetic_FContig_ScalarMul_PreservesFContig
  - Arithmetic_FPlusF_PreservesFContig

Section 7 — broadcast always produces C-contig output:
  - Broadcast_FContig_PlusFCol_PreservesFContig

Currently passing (NumPy-aligned behavior confirmed):
- np.zeros/ones/full preserve F-contig when given an F-Shape (workaround
  for missing order= parameter, but layout IS preserved)
- np.empty(order='C'/'F'/'A'/'K') — correct behavior; A/K throw
- All reductions (sum, mean, min, max, axis=0, axis=1) work on F-contig
- Transpose toggles C<->F contig correctly
- Slicing: 1-col slice of F-contig is both C and F contig (matches NumPy)
- Slicing: row-slice of F-contig is neither (matches NumPy)
- Shape.Order property reports correct char based on flags
- Scalar multiply on F-contig produces correct values (just loses layout)
- Indexed iteration on F-contig produces C-order logical traversal
  (matches NumPy's arr.flat semantics)

CI verification:
- Full suite with CI filter: 6051 pass, 0 fail (net8.0 and net10.0)
- With TestCategory=OpenBugs: 11 fail (as expected)
- With TestCategory!=OpenBugs: 28 pass (0 regressions)

Next steps: fix each [OpenBugs] by wiring order through the respective
API using OrderResolver. Remove the attribute after verifying the test
passes with NumPy's expected output.
Expands OrderSupport.OpenBugs.Tests.cs from 39 to 67 tests covering
every NumPy function that accepts an 'order' parameter.

NumPy functions covered (15 total that accept order=):
- Creation: empty, empty_like, zeros, zeros_like, ones, ones_like,
  full, full_like, eye
- Conversion: array, asarray, asanyarray, copy
- Manipulation: reshape, ravel
- Method: astype, flatten, copy

New sections added:
- Section 11: np.empty_like (default K, preserves source layout)
- Section 12: np.zeros_like (default K) + values-are-zero test
- Section 13: np.ones_like (default K) + values-are-one test
- Section 14: np.full_like (default K) + values-are-fill test
- Section 15: np.eye (C/F order) + identity values test
- Section 16: np.asarray / np.asanyarray API gaps
- Section 17: astype (default K, preserves source layout)
- Section 18: np.reshape with F-order (column-major fill)
- Section 19: np.ravel with C/F/A/K orders
- Section 20: np.array with order (Array input overload)
- Section 21: np.asfortranarray / np.ascontiguousarray (missing APIs)

Results (net8.0 and net10.0):
- 42 tests pass (document working behavior / NumPy parity)
- 25 tests fail (marked [OpenBugs], excluded from CI via filter)

25 [OpenBugs] documenting gaps:
- *_like don't preserve F-contig (5 tests: empty/zeros/ones/full/astype)
- np.copy/NDArray.copy order ignored (7 tests from prior commit)
- flatten/ravel order ignored (3 tests)
- arithmetic/broadcast don't preserve F-contig (3 tests)
- np.eye has no order param (1 test)
- np.reshape has no order param (1 test)
- np.array order ignored (1 test)
- np.asarray/asanyarray have no NDArray+order overload (2 tests)
- np.asfortranarray/np.ascontiguousarray don't exist (2 tests)

Confirmed NumPy parity (new passing tests):
- np.empty_like/zeros_like/ones_like/full_like on C-contig (K default -> C)
- np.zeros_like/ones_like/full_like produce correct fill values
- np.eye default produces C-contig identity matrix
- astype preserves C-contig from C source (K default)
- astype preserves values during type conversion
- np.reshape default produces row-major fill
- np.ravel default is C-order
- np.array default produces C-contig

CI verification:
- TestCategory!=OpenBugs: 6065 pass, 0 fail (net8.0 and net10.0)
- TestCategory=OpenBugs: 25 fail (as expected bug reproductions)

All NumPy order baselines verified via Python 2.4.2 side-by-side scripts.
Nucs added 28 commits April 22, 2026 23:35
…nBugs])

All passing — NumSharp's broadcast primitives correctly produce NumPy-aligned
stride patterns and layout flags when sourced from F-contig inputs.

Findings (parity confirmed against NumPy 2.x)
- broadcast_to(F(4,3), (2,4,3)) → strides (0, 8, 32), neither C- nor F-contig.
- broadcast_to(C(3,4), (2,3,4)) → strides (0, 32, 8), neither flag.
  The stride=0 leading dim always knocks BOTH contiguity flags off.
- broadcast_to values verified for (2,4,3) case — replication along the new
  outer axis preserves inner data indexing for any input layout.
- broadcast_arrays(F, scalar) → first output preserves F-contig (shape
  already matches target, no stride=0); second is all-stride-0 (neither flag).
- broadcast_arrays(F(2,3), F(2,1)) → first F preserved; second has stride=0
  on the broadcast axis (neither flag).

These confirm that F-contig source arrays are handled correctly through
the broadcasting pipeline, at least for shape/layout — a real expectation
given broadcasting is a Shape/strides-only operation (no value copy).
…m SGD

Extends the previous fusion-demo into a fully operational trained
classifier using the NeuralNetwork.NumSharp BaseLayer/BaseCost/
BaseOptimizer abstractions. Forward AND backward passes each collapse
the post-matmul element-wise chunk into a single NpyIter kernel.
Training end-to-end: per-epoch loss and accuracy, final test-set
evaluation, IL-kernel cache and delegate-slot reporting.

Architecture
------------
  784 (input) -> 128 (ReLU) -> 10 (linear logits), float32 throughout.

  Forward pass (per layer, fused bias+activation in ONE NpyIter):
    ReLU layer: y = max(xW + b, 0)                 -- NpyExpr.Max(Input(0)+Input(1), 0)
    Linear layer: y = xW + b                        -- NpyExpr.Input(0) + NpyExpr.Input(1)

  Backward pass (per layer, fused ReLU gradient mask in ONE NpyIter):
    ReLU backward: gradPreact = gradOut * (y > 0)   -- NpyExpr.Input(0) * NpyExpr.Greater(Input(1), 0)
    Linear backward: gradPreact = gradOut           -- pass-through

  Loss: SoftmaxCrossEntropy (combined, numerically stable). Forward
    computes max-subtracted softmax + categorical cross-entropy;
    Backward returns (softmax - labels)/batch via a cached softmax.

  Optimizer: Adam (the existing class, with the ms/vs init bug fixed).

Training signal
---------------
Synthetic fallback now generates *learnable* data: 10 class templates
in [-1,1]^784 shared across train/test splits + per-sample Gaussian
noise sigma=1.5. Both splits share templates so generalization is
meaningful.

Measured on a 6000-train / 1000-test synthetic split, batch_size=128,
Adam lr=0.001, 5 epochs:

  Epoch  1/5  loss=0.4183  train_acc=88.47%  (~20s)
  Epoch  2/5  loss=0.0013  train_acc=100.00% (~20s)
  Epoch  3/5  loss=0.0009  train_acc=100.00% (~20s)
  Epoch  4/5  loss=0.0007  train_acc=100.00% (~20s)
  Epoch  5/5  loss=0.0006  train_acc=100.00% (~20s)
  Final test accuracy: 100.00%   Total: 100.7s (net8) / 96.5s (net10)

Fusion probe (post-matmul bias+ReLU, 200 passes, 500-iter warmup
to cross .NET tiered-JIT promotion): 2.4-3.0x speedup fused vs.
np.add + np.maximum. Correctness: bit-exact (max |diff| = 0).

Kernel cache delta: 6 (one per unique expression-dtype combination:
fused bias+ReLU forward, fused bias-only forward, fused ReLU backward,
fused bias+ReLU forward [probe path], and two kept in FusedMlp/NaiveMlp
from the earlier commit). Invariant across epoch / batch iteration
count -- compiled once per process, cache-hit thereafter.
Delegate slots: 0 (pure DSL composition, no captured lambdas).

Files
-----
New:
  examples/NeuralNetwork.NumSharp/MnistMlp/FullyConnectedFused.cs
    -- Dense layer with bias + optional fused activation. Parameters["w"]
       / Parameters["b"] + Grads["w"] / Grads["b"] per the BaseLayer
       contract, so the stock Adam optimizer iterates it without change.
       Three fused NpyIter kernels: bias+ReLU forward, bias-only forward,
       ReLU gradient mask for backward. He-init for ReLU, Xavier for
       linear. Float32 end-to-end.

  examples/NeuralNetwork.NumSharp/MnistMlp/SoftmaxCrossEntropy.cs
    -- Combined softmax + categorical cross-entropy loss. Forward does
       max-subtracted softmax then CE; Backward returns the simplified
       (softmax - labels)/batch form (numerically stable -- avoids
       differentiating log(softmax) on the critical path). Caches
       softmax output between Forward and Backward calls. Ships a
       OneHot helper that handles Byte / Int32 / Int64 label dtypes.

  examples/NeuralNetwork.NumSharp/MnistMlp/MlpTrainer.cs
    -- Explicit training + evaluation loop that uses the existing
       BaseLayer / BaseCost / BaseOptimizer abstractions. Sidesteps
       the built-in NeuralNet.Train which uses
       x[currentIndex, currentIndex + batchSize]  -- that's 2-index
       integer selection in NumSharp, not slicing, and reads the wrong
       data silently. MlpTrainer uses the correct x[$"{start}:{end}"]
       form. Evaluate() runs the same forward pass over the full test
       set and argmax-compares against integer labels.

Modified:
  examples/NeuralNetwork.NumSharp/MnistMlp/MnistLoader.cs
    -- Added LoadFullDataset(dataDir, syntheticTrain, syntheticTest, seed)
       for the canonical train-images/train-labels/t10k-images/t10k-labels
       filename set, plus learnable synthetic fallback
       (SynthesizeSamples with shared class templates across splits).

  examples/NeuralNetwork.NumSharp/MnistMlp/Program.cs
    -- Rewritten to drive the training pipeline end-to-end: load data,
       fusion probe with correctness + speed check, build model (2x
       FullyConnectedFused), train with Adam + SoftmaxCrossEntropy,
       report per-epoch stats + final test accuracy + kernel
       instrumentation.

  examples/NeuralNetwork.NumSharp/Optimizers/Adam.cs
    -- Fixed the ms/vs zero-init. The existing code had the init paths
       commented out with //ToDo: np.full, so layer.Parameters["w"]
       threw KeyNotFoundException on the first step. Now initializes
       via np.zeros(param.Shape, param.dtype).

Audit notes (not changed in this commit)
----------------------------------------
Other components in the example project are stubbed-out with //ToDo:
markers:
  - Softmax.Forward and Sigmoid.Forward have empty bodies.
  - CategoricalCrossentropy doesn't clip predictions and its Backward
    formula assumes softmax has already been applied (it hasn't). Uses
    np.log(preds) with no epsilon -- div-by-zero on saturation.
  - Accuacy.Calculate (note misspelling) calls np.argmax(preds) without
    axis, so it returns a scalar not a per-row argmax. Useless for
    batched accuracy.
  - NeuralNet.Train uses x[i, j] (2-index integer selection) where
    x[$"{i}:{j}"] (slice) was intended -- training on the wrong data.

The new code bypasses each of these with its own correctly-implemented
path. If and when they get fixed in place, callers can migrate.

Build / test: 0 warnings, 0 errors on net8.0 and net10.0; full
NumSharp.UnitTest (6446 tests excluding OpenBugs/HighMemory) passes
with the Adam fix applied.
…ugs])

Coverage for repeat, roll, stack, expand_dims, squeeze, swapaxes, moveaxis,
atleast_{1,2,3}d with F-contig inputs.

Passing parity (18 tests)
- Repeat without axis: flattens to 1-D as NumPy does; values verified.
- Roll no axis + roll with axis=0 on F(4,3): F-contig preserved, values match.
- Stack([F,F]): (2,4,3), values correct.
- expand_dims(F, 0/1/2): F-contig preserved for all three insertion points.
- swapaxes(C(3,4), 0, 1) → (4,3) F-contig (NumPy's stride-swap).
- swapaxes(F(4,3), 0, 1) → (3,4) C-contig (mirror of above).
- atleast_1d(scalar), atleast_2d(1D) → both-contig trivially.
- atleast_3d(F(4,3)) → (4,3,1) F-contig preserved.
- moveaxis(F(4,3), 0, -1) → (3,4) C-contig (2-D transpose).

Flagged [OpenBugs] (2 tests)
- Repeat_FContig_Axis0_ApiGap: np.repeat(a, n, axis=...) not supported by
  NumSharp — only the axis-less flatten+repeat form. Not an F-order bug
  per se; an API surface gap vs NumPy.
- Squeeze_FContigWithUnitDim_PreservesFContig: NumPy squeeze(F(2,1,3)) →
  (2,3) F-contig. NumSharp produces C-contig — squeeze rebuilds the shape
  without carrying F-strides through.
…enBugs])

Documents three concrete .npy-format gaps vs NumPy:

Passing (1 test)
- NpSave_FContig_RoundTrip_Values_Preserved: Values survive save→load for
  F-contig input. NumSharp casts NDArray to Array via ToMuliDimArray<T>,
  producing a C-order copy, so save writes C-order bytes + a
  "fortran_order: False" header. Round-trip is internally consistent on
  values (but not on layout — see below).

Flagged [OpenBugs] (3 tests)
- NpSave_FContig_Header_ContainsFortranOrderTrue: np.save.cs:172 hardcodes
  "'fortran_order': False" in the header, regardless of the NDArray's
  actual layout. NumPy writes "True" when the source is F-contig. This is
  a round-trip info loss for NumPy interop.
- NpLoad_NumPyFortranOrderTrue_DoesNotThrow: np.load.cs:322 explicitly
  throws Exception when it reads "'fortran_order': True" in a .npy header.
  NumPy-generated F-contig .npy files can't be loaded by NumSharp at all.
- NpSave_FContig_RoundTrip_PreservesFContigFlag: End-to-end round trip
  always yields C-contig output, never F-contig (consequence of hardcoded
  header plus C-order byte writing).
Coverage for np.around and np.round_ with F-contig inputs.

Passing (3 tests)
- Around/Round values match NumPy for both decimals=1 and decimals=2
  (banker's rounding: 1.345 → 1.34).
- Both functions round correctly regardless of input layout.

Flagged [OpenBugs] (3 tests)
- Around_FContig2D_PreservesFContig: np.around is element-wise per NumPy,
  so its output should preserve F-contig layout. NumSharp's around doesn't
  route through the element-wise dispatcher's F-preservation helper —
  output flips to C-contig.
- Round_FContig2D_PreservesFContig: Same gap; np.round_ is an alias of
  around with identical dispatch path.
- Around_FContig3D_PreservesFContig: Confirmed the gap extends to 3-D
  where F-strides aren't trivially both-contig.
…Bugs])

Coverage for the Decimal dtype (non-SIMD, scalar-full dispatcher path).
Confirms F-contig layout preservation for all element-wise operations
and a matching scalar reduction.

Passing (9 tests)
- Binary F+F, unary negate/abs, comparison F>F, scalar multiply F*k —
  all preserve F-contig layout on the 2-D Decimal path.
- astype(F double, decimal) with default 'K' order preserves F-contig.
- 3-D F+F and -F (unary) preserve F-contig through the general
  coordinate-based iteration path for non-SIMD dtypes.
- Sum without axis returns correct scalar (no layout involved).

Flagged [OpenBugs] (1 test)
- Decimal_FContig3D_SumKeepDims_PreservesFContig: Same gap flagged in
  Section 41 for double, now confirmed for Decimal — 3-D axis reductions
  with keepdims=True produce C-contig output even when the Decimal
  scalar-full path is taken.

Net effect: the Decimal scalar-full dispatcher path is on par with the
SIMD path for F-contig preservation in element-wise operations — the gap
is isolated to axis-reduction dispatchers.
Coverage for layout flag computation and slicing under unusual shapes.

Passing (11 tests)
- Empty arrays (0,3) and (3,0) F-order: both C-contig and F-contig flags
  are True (NumPy convention: any dim=0 → both flags True).
- 0-D scalar: both flags True, ndim=0.
- Size-1 dim in any of three positions (leading/middle/trailing) of a 3-D
  F-contig: stays strictly F-contig (NOT automatically both-contig —
  strides still follow F pattern, which is NOT a C pattern here).
- C(2,1,3) vs F(2,1,3): different strides, different flags — confirms the
  flag computation treats C/F truly distinctly under unit dims.
- 6-D F-contig detection: the ArrayFlags cache handles high-dim shapes.
- Column slice F[:, lo:hi] preserves F-contig (the inner stride is still
  the base dtype size; outer stride still the full column span).
- Row slice F[lo:hi, :] and strided slice F[::2, :] both become neither
  C- nor F-contig, matching NumPy exactly.

Flagged [OpenBugs] (1 test)
- HighDim_6D_FContig_Sum_Axis0_KeepDims_PreservesFContig: Same axis
  reduction gap as Section 41 — confirms the bug isn't ndim-bounded;
  any F-contig axis reduction loses the layout regardless of rank.
…penBugs])

Isolates the conditions that trigger the pre-existing SetIndicesND
assertion (src/NumSharp.Core/Selection/NDArray.Indexing.Selection.Setter.cs:552:
"dstOffsets.size == values.size"). Provides a minimal reproducing matrix
for the eventual fix.

Passing (2 tests)
- FancyWrite_1D_ScalarRHS_Works: 1-D target + scalar → ndsCount == ndim,
  so dstOffsets.size equals selected-element count; assertion holds.
- FancyWrite_1D_ArrayRHS_MatchingSize_Works: 1-D target + matching array
  RHS — same as above, trivially matches.

Flagged [OpenBugs] (3 tests)
- FancyWrite_2D_CContig_ScalarRHS_Crashes: 2-D C-contig target + scalar.
  dstOffsets.size=2 (selected rows) vs values.size=8 (elements). Assert
  fires. Confirmed NOT F-order specific — layout is orthogonal.
- FancyWrite_2D_CContig_MatchingArrayRHS_Crashes: Same target + matching
  (2,4) value array. Still dstOffsets.size=2 vs values.size=8 → same
  assertion failure. The assert checks size match, not broadcast
  compatibility.
- FancyWrite_2D_FContig_ScalarRHS_Crashes: Direct F-contig variant of
  the first case, confirming layout-orthogonality.

Net signal: three layers of the same bug — scalar RHS, matching array
RHS, and F-contig layout — all trigger the identical SetIndicesND assert.
The fix should compare dstOffsets.size to the broadcast-target element
count, not values.size.
…re np.dot

Profiled the 100 s / 5 epoch training loop and found that backward was 98.4%
of the total time. The two np.dot calls in FullyConnectedFused.Backward
(W-grad and input-grad) both feed a .transpose() view into np.dot, and
NumSharp's np.dot falls into a ~100x-slower generic path when an operand
is non-contiguous.

Measured np.dot cost on the layer-1 backward shapes:
  np.dot(x.T , grad)  = (784,128)*(128,128)  W-grad     : 240.35 ms
  np.dot(grad , W.T)  = (128,128)*(128,784)  in-grad    : 226.52 ms
  np.dot(contig_x , grad)   W-grad   (same, contig)    :   2.49 ms
  np.dot(grad , contig_W)   in-grad  (same, contig)    :   2.28 ms

So a 400 KB .copy() of the transposed view ahead of each dot brings
layer-1 backward from ~467 ms to ~5 ms. Forward+bias+ReLU etc. were never
the bottleneck.

Full training (6000 synthetic / 1000 test, batch=128, 5 epochs, Adam lr=1e-3):

  Before                                   After
  -------------------------------------    --------------------------------------
  Epoch  1/5   20864 ms   loss=0.4183      Epoch  1/5    702 ms   loss=0.4183
  Epoch  2/5   20310 ms                    Epoch  2/5    632 ms
  ...                                      ...
  Total       100.7 s                      Total          3.2 s
  per-batch    475.4 ms                    per-batch     13.67 ms

Identical loss trajectory and final test accuracy (100.00%). The kernel
cache delta, fusion probe correctness + speedup, and delegate-slot count
are all unchanged — the speedup comes purely from avoiding np.dot's slow
non-contiguous path.

Post-fix epoch breakdown (629 ms / 46 batches / ~13.7 ms per batch):
  forward  :  20.3%
  loss     :   1.5%
  backward :  52.5%   (down from 98.4% pre-fix; both dots now on fast path)
  optimizer:  25.8%

Followup (not in this commit, scope creep): NumSharp's np.dot should grow
a BLAS-style gemm-with-transpose-flags fast path. The 100x gap between
strided and contiguous operands is a general perf cliff — this
workaround is local to the example project; the underlying issue remains
in the core library.

net8.0: 100.7 s -> 3.2 s
net10.0: 96.5 s -> 2.8 s

Full NumSharp.UnitTest regression (6502 tests excluding OpenBugs/HighMemory)
remains green.
…Converts integration

Adds `NumSharp.Char8` — a [StructLayout(LayoutKind.Sequential, Size=1)] readonly struct representing a single byte as a character. Equivalent to NumPy's `dtype('S1')` / `numpy.bytes_` of length 1, and to a Python `bytes` object of length 1. Interoperable with C#'s `byte`, `char` (Latin-1), and `string`.

Motivation
----------
NumSharp previously used .NET's 2-byte `char` as its 16-bit character dtype. NumPy has no native 2-byte char — its nearest equivalent is `dtype('S1')` which is 1-byte. The former `np.frombuffer(buffer, "S1")` path in NumSharp routed "S1" through `NPTypeCode.Char` (2 bytes), producing a buffer-overread bug (outer reading N elements × 1 byte, inner `UnmanagedMemoryBlock<char>` allocating N × 2 bytes). This commit introduces Char8 as the correct 1-byte analogue so future NumSharp code can expose a NumPy-compatible S1 dtype.

Char8 is intentionally standalone — not an NPTypeCode enum value yet. A full enum integration would touch ~50+ switch statements across DefaultEngine, ILKernelGenerator, NpyIter, casting, and Converts — that refactor is a separate issue. This commit lays the type foundation with parallel API coverage and is ready to be wired into NPTypeCode when that work starts.

Port source & layout
--------------------
Char8 is adapted from System.Char (dotnet/runtime main, src/libraries/System.Private.CoreLib/src/System/Char.cs). That file and its dependencies were fetched into src/dotnet/ as a reference library:

  System/Char.cs (2,066 lines)                       - primary source-of-truth; Latin1CharInfo table copied verbatim
  System/CharEnumerator.cs (55 lines)                - companion
  System/IUtfChar.cs (35 lines)                      - internal UTF-abstraction interface
  System/Globalization/CharUnicodeInfo.cs (542)      - Unicode category/numeric/surrogate constants
  System/Globalization/UnicodeCategory.cs (39)       - enum
  System/Globalization/GlobalizationMode.cs (99)     - mode flags
  System/Globalization/TextInfo.cs (844)             - culture-aware ToUpper/ToLower (not needed for Char8)
  System/Text/Rune.cs (1,564)                        - Unicode scalar type
  System/Text/UnicodeUtility.cs (185)                - IsValidUnicodeScalar etc.
  System/Text/UnicodeDebug.cs (75)                   - debug helpers
  System/Text/Ascii.cs + 6 partials (3,937)          - ASCII operations
  System/Text/Latin1Utility.cs + Helpers.cs (1,228)  - Latin-1 narrow/widen — directly applicable to Char8<->char
  System/Text/Unicode/Utf8Utility.cs (296)           - UTF-8 encoding
  System/Text/Unicode/Utf16Utility.cs (314)          - UTF-16 encoding
  Common/src/System/HexConverter.cs (616)            - hex digit helpers
  System/Number.Parsing.cs (1,505)                   - number parsing (for ThrowOverflowException<T>)

src/dotnet/INDEX.md updated with:
- New "Primitive Types (Char family)" section listing Char8 dependencies
- New "Text: ASCII / Latin-1 / Unicode / Rune" inventory section
- New "Parsing & Conversion Helpers" section (HexConverter, Number.Parsing)
- "Key APIs to Port for Char8" — detailed API surface mapping Char -> Char8
- "Char8 Port Strategy" — 5-phase plan
- "Transitive Dependencies NOT Fetched" — 10 deep runtime internals intentionally omitted (PackedSpanHelpers, AppContextConfigHelper, CultureData, NumberBuffer, etc.) with stub guidance

Char8 implementation — 5 partial files (~1,450 lines)
-----------------------------------------------------

src/NumSharp.Core/Primitives/Char8.cs (465 lines, core)
  - [StructLayout(LayoutKind.Sequential, Size=1)] readonly struct with single byte m_value field. Verified Unsafe.SizeOf<Char8> == 1.
  - Constants: MaxValue (0xFF), MinValue (0x00), Latin1CharInfo[256] table (verbatim from Char.cs), flag bits
  - Constructors: Char8(byte), Char8(char) throws on > 0xFF
  - Implicit widening: Char8 -> byte, int, uint, char (Latin-1 mapping: 0xE9 -> U+00E9)
  - Explicit narrowing: char -> Char8 (throws on > 0xFF), int -> Char8 (throws on outside [0, 255])
  - Unchecked variants: FromCharTruncating, FromInt32Truncating
  - IComparable, IComparable<Char8>, IEquatable<Char8>; GetHashCode returns m_value directly
  - ToString: length-1 string via Latin-1 decode
  - Parse / TryParse: single-char string only, throws FormatException on empty, multi-char, or non-Latin-1
  - Classification — ASCII strict (NumPy parity, diverges from System.Char):
    IsLetter/IsAlpha, IsDigit, IsUpper, IsLower, IsWhiteSpace/IsSpace, IsLetterOrDigit/IsAlnum, IsAscii/IsAsciiChar, IsPrintable, IsControl (includes C1 0x80..0x9F for Char.cs compat)
  - Classification — Latin-1 (Char.cs heritage): IsLetterLatin1, IsUpperLatin1, IsLowerLatin1, IsWhiteSpaceLatin1, IsPunctuation, IsSeparator, IsSymbol, IsNumber, GetUnicodeCategory — all use Latin1CharInfo table directly
  - GetNumericValue: -1.0 for non-digits, 0..9 for 0..9
  - Surrogate predicates: always false (a single byte cannot encode a surrogate)
  - Case conversion:
    * ToUpper/ToLower/ToUpperInvariant/ToLowerInvariant — ASCII bit-flip only (NumPy parity: 0xE9 unchanged)
    * ToUpperLatin1/ToLowerLatin1 — full Latin-1 fold (0xE9 <-> 0xC9, handles 0xDF sharp-s and 0xFF no-fold edge cases)
  - Operators: ==, !=, <, >, <=, >=, +, -, *, /, %, &, |, ^, ~, <<, >>, ++, --, unary + / - — all wrap at byte boundary (verified 0xFF + 1 == 0x00)
  - IConvertible: TypeCode.Byte, full implementation except ToDateTime; ToSByte throws OverflowException for values > 127
  - IFormattable, ISpanFormattable: TryFormat writes 1 char
  - IUtfChar equivalent (public static): CastFrom(byte/char/int/uint/ulong), CastToUInt32
  - Binary read/write: TryRead/WriteLittleEndian/BigEndian, GetShortestBitLength, GetByteCount always 1
  - Bit ops: Abs, Max, Min, IsZero, IsEvenInteger, IsOddInteger, IsPow2, Log2 (8-bit), LeadingZeroCount (returns 8 for zero), TrailingZeroCount (returns 8 for zero), PopCount, RotateLeft, RotateRight (8-bit width)
  - String interop: FromStringAscii/Latin1/Utf8, ToStringAscii/Latin1/Utf8, FromBytes, ToBytes — matches Python s.encode('ascii')/'latin-1')/'utf-8') semantics

Char8.Operators.cs (175 lines, mixed-type ops)
  - Char8 <-> char/byte/int comparison operators in both directions
  - Arithmetic with int and byte (wraps at byte boundary)
  - Non-boxing Equals overloads for char, byte, int
  - No-throw conversions: TryFromChar, TryFromInt32
  - Deconstruct(out byte)
  - Span reinterpret helpers (zero-copy via MemoryMarshal.Cast): AsBytes, AsChar8s
  - Formatting: ToHex "0xNN", ToEscaped Python-style (n, r, t, 0, backslash, single-quote, double-quote, xNN)

Char8.Conversions.cs (225 lines, dtype interop)
  - Instance To*: ToBoolean, ToByte, ToSByte, ToInt16/32/64, ToUInt16/32/64, ToChar, ToSingle, ToDouble, ToDecimal
  - Factory From* (throwing): Boolean, Byte, SByte, Int16/32/64, UInt16/32/64, Char, Single, Double, Decimal
  - Saturating: FromInt32Saturating, FromInt64Saturating, FromDoubleSaturating (NaN->0, infinity->0/255)
  - Truncating: FromInt16/UInt16/UInt32/Int64/UInt64Truncating
  - Bulk array conversions: ToBooleanArray/ToInt16Array/ToInt32Array/ToInt64Array/ToSingleArray/ToDoubleArray/ToCharArray/FromInt32Array/FromDoubleArray

Char8.Spans.cs (185 lines, span primitives + UTF-8 classification)
  - Char8SpanExtensions static class with ReadOnlySpan<Char8> extension methods:
    Search: IndexOf, LastIndexOf, Contains, IndexOfAny(2/3/span), Count
    Equality: SequenceEqual, EqualsIgnoreCaseAscii, StartsWith, EndsWith, CompareTo
    String interop (no materialization): EqualsString, StartsWithString, EndsWithString
  - UTF-8 byte classification: IsUtf8SingleByte (0x00-0x7F), IsUtf8ContinuationByte (0x80-0xBF), IsUtf8LeadByte (0xC2-0xF4), IsUtf8Invalid; GetUtf8SequenceLength returns 1/2/3/4 for valid lead bytes, 0 otherwise

Char8.PyBytes.cs (400 lines, Python bytes array methods)
  Each method mirrors bytes.xxx(...) with full Python 3 parity:
  - Strip/LStrip/RStrip — whitespace or custom char set
  - Split (whitespace), Split(sep), RSplit, SplitLines (bytes-only: n, r, rn — NOT v, f, x1c-1e), Partition, RPartition
  - Join
  - Replace — handles empty pattern (inserts new between every byte), count parameter, overlap prevention
  - Case: Upper, Lower, SwapCase, Capitalize, Title
  - Padding: LJust, RJust, Center (CPython formula: extra padding left when pad and width are both odd), ZFill (preserves leading sign byte)
  - Array predicates: IsDigits, IsAlphas, IsAlnums, IsSpaces, IsUppers (requires at least one cased byte), IsLowers, IsTitles (title case alternation), IsAsciis, IsPrintables

src/NumSharp.Core/Utilities/Converts.Char8.cs (324 lines, Converts integration)
  Parallel to Converts.Native.cs for all 12 NumSharp dtypes. Semantics match existing Converts.* primitives (throw on overflow/NaN):
  - Char8 -> X primitives: ToBoolean, ToByte, ToSByte (>127 throws), ToChar (Latin-1 widen), ToInt16/32/64, ToUInt16/32/64, ToSingle, ToDouble, ToDecimal, ToString (length-1 Latin-1 string)
  - X -> Char8 primitives: ToChar8(bool/byte/sbyte/char/short/ushort/int/uint/long/ulong/float/double/decimal/Char8/string) — all throw OverflowException for out-of-range
  - Object / IConvertible dispatch: ToChar8(object), ToChar8(object, provider) — pattern-match on IConvertible.GetTypeCode()
  - Generic dispatcher: ToChar8<T>(T value) where T : struct — switches on InfoOf<T>.NPTypeCode with Unsafe.As<T, X>
  - Typed reverse: ToObject(Char8, NPTypeCode) returns boxed target primitive
  - Bulk arrays: ToChar8Array, ToByteArray, ToInt32Array, ToDoubleArray, ToChar8ArrayFromInt32, ToChar8ArrayFromDouble

Bugs caught by battletest
-------------------------
Three parity bugs surfaced and fixed during Python bytes oracle comparison:

1. Count with empty pattern — Python returns len(s) + 1 (empty pattern "occurs" between every byte + ends). Initial impl returned 0. Fixed in Char8.Spans.cs:Count(ReadOnlySpan<Char8>, ReadOnlySpan<Char8>).

2. Center asymmetric padding — when width - len is odd, Python puts extra padding on the LEFT via CPython formula left = pad/2 + (pad & width & 1), which gives extra left only when both pad and width are odd. Initial impl used plain floor division, producing extra on the right. Fixed in Char8.PyBytes.cs:Center.

3. SplitLines too permissive — my initial implementation recognized v, f, x1c, x1d, x1e as line separators (per str.splitlines). But bytes.splitlines() only recognizes n, r, and rn. Fixed in Char8.PyBytes.cs:SplitLines. Verified: b"a\x0bb".splitlines() == [b'a\x0bb'] in Python, now C# agrees.

Battletest coverage
-------------------
Python bytes oracle vs Char8 — diff identical across 250 lines:
- Main: 91 lines (classify, case, strip, split, splitlines, partition, join, replace, count, padding, search, case-predicates)
- Edge: 131 lines (boundary classify at all key bytes; empty-array ops; overlap count/replace; split edges; splitlines edges with n, rn, r, x0b, x0c, x1c; partition edges; strip custom; join single-elem; padding when input>=width; title edge cases; case predicates)
- Converts: 28 lines (uint8 -> bool/int16/32/64/uint16/32/64/float32/64 for 8 boundary values; X -> uint8 for bool/int/float in-range; boundary truncation 254.9 -> 254)

C#-specific edge verification — 270+ assertions all pass:
- Char8.cs edges (212 assertions): From*/To* overflow (19 throws), boundary OK (8), saturating (9), truncating (4), TryFrom (5), arithmetic wrap + div-by-zero (16), mixed-type comparison (9), UTF-8 classification 0x00-0xFF (24), ToHex/ToEscaped (11), Parse/TryParse (10), span reinterpret with unsafe address verification (6), Deconstruct (1), string interop (6), IConvertible including ToSByte(200) overflow (9), Abs/Max/Min/IsZero/IsEven/IsOdd/IsPow2/Log2/LeadingZeroCount/TrailingZeroCount/PopCount/RotateLeft/RotateRight (30), surrogate predicates always false (3), Latin-1 divergence from NumPy (8), Latin-1 case fold with sharp-s/ÿ no-fold (7), EqualsIgnoreCaseAscii (4), instance To* conversions (10), struct layout (3).
- Converts.Char8.cs: generic ToChar8<T> dispatch (5), implicit-conversion interop (2), object dispatch (5), string dispatch (1), overflow throws (8), round-trip for 0/1/127/128/255 (5).

Test suite
----------
NumSharp.UnitTest net10.0 with --filter "TestCategory!=OpenBugs&TestCategory!=HighMemory":
- Before: 6,433 tests passed
- After: 6,502 tests passed (+69 new Char8 cases via source-generated discovery)
- Failed: 0 — zero regressions
Extends MlpTrainer.Train so the test set is evaluated every
min(5, epochs) epochs (plus always on the final epoch). Result record
now carries a List<(int Epoch, float TestAcc)> so callers can inspect
the test-accuracy trajectory, not just the final number.

Also retunes the demo so there's actually a visible loss-reduction
curve over many epochs:
 - Epochs: 5 -> 100.
 - Synthetic class-template noise sigma: 1.5 -> 2.5. The more
   aggressive noise keeps test accuracy below 100% for longer and
   surfaces the real convergence shape instead of instant saturation
   after one epoch.

Observed trajectory on 6000-train / 1000-test (batch=128, Adam
lr=1e-3):
  Epoch   1/100  loss=1.1246  train_acc= 73.08%
  Epoch   2/100  loss=0.0089  train_acc= 99.92%
  Epoch   5/100  loss=0.0021                    test_acc=98.88%
  Epoch  10/100  loss=0.0008                    test_acc=99.22%
  Epoch  25/100  loss=0.0002                    test_acc=99.67%
  Epoch  55/100  loss=0.0000                    test_acc=99.78%
  Epoch 100/100  loss=0.0000                    test_acc=99.89%
  Total training time: ~41 s (net8.0)

Fusion probe, IL-kernel cache delta (6), delegate-slot count (0), and
bit-exact fused-vs-naive correctness are all unchanged.
Audited every class under examples/NeuralNetwork.NumSharp outside the
new MnistMlp/ subtree and fixed or completed the ones that were
incomplete, stubbed-out, or mathematically wrong. Every fix verified
against analytically-computed reference values + finite-difference
gradient checks (29 checks, all passing).

Layers/Activations/Softmax.cs
-----------------------------
  Was: Forward was an empty body; Backward used the sigmoid derivative
       (grad * Output * (1 - Output)) which is wrong for softmax.
  Now: Numerically stable row-wise softmax (per-row max-subtract
       + exp + normalize). Backward implements the correct
       Jacobian-vector product:
         dL/dx = softmax * (grad - sum(grad * softmax, axis=1, keepdims))
       Verified against finite differences on three components.

Layers/Activations/Sigmoid.cs
-----------------------------
  Was: Forward was an empty body.
  Now: sigma(x) = 1 / (1 + exp(-x)). Backward was already correct and
       is unchanged. Verified: sigmoid(0)=0.5, sigmoid(+/-10) at
       float32 saturation limits.

Cost/CategoricalCrossentropy.cs
-------------------------------
  Was: No clipping -> log(0) risk when softmax saturates; backward
       formula (output - labels) / output was neither the standalone
       CE derivative -labels/preds nor the combined softmax+CE form
       (preds - labels). And forward used np.mean over ALL elements
       (dividing by batch*classes) instead of the standard per-batch
       reduction.
  Now: Standard per-batch CCE with np.clip(preds, eps, 1-eps):
         Forward:  L = -sum(labels * log(clipped)) / batch
         Backward: dL/dpreds = -labels / clipped / batch
       Verified against analytical loss for a 2-class x 3-sample
       example and the three non-trivial grad components.

Cost/BinaryCrossEntropy.cs
--------------------------
  Was: No clipping; backward formula was correct per-element but
       didn't divide by N to match the mean reduction in forward.
  Now: Standard BCE with np.clip(preds, eps, 1-eps) in both forward
       and backward, and /preds.size to match the mean reduction.
       Verified against analytical loss for a 4-element example.

Metrics/Accuracy.cs (Accuacy)
-----------------------------
  Was: np.argmax(preds) and np.argmax(labels) called without axis,
       so both collapse to a single scalar — meaningless for batched
       predictions.
  Now: Per-row argmax via axis=1, compare, mean. Returns fraction
       correct in [0, 1]. Class name retains the original
       misspelling "Accuacy" for backward compatibility.

Metrics/BinaryAccuacy.cs
------------------------
  Was: Returns null (stub).
  Now: round(clip(preds, 0, 1)) == labels, mean. Handles the case
       where preds are sigmoid outputs slightly out of [0, 1] due to
       float round-off.

Layers/FullyConnected.cs
------------------------
  Was: No bias term; weight init was np.random.normal(0.5, 1, ...) —
       float64 (wrong dtype) and skewed mean 0.5 (wrong statistics).
  Now: He-normal init for ReLU, Xavier/Glorot init otherwise; bias
       defaults to zeros. Opt-in useBias=true constructor flag. All
       float32. Backward computes Grads["w"], Grads["b"], and
       InputGrad; relies on the stride-aware np.dot that now handles
       transposed views directly.

NeuralNet.cs
------------
  Was: Train() used  x[currentIndex, currentIndex + batchSize]  which
       is 2-index integer element selection in NumSharp, not a slice
       — the loop silently trained on whichever single element was
       at that (row, col). Termination depended on the slice
       returning null, which never happens. Optimizer step counter
       was reset per epoch, breaking Adam's bias-correction schedule.
  Now: Uses string-slicing x[$"{start}:{end}"] for the batch, counts
       batches properly from x.shape[0] / batchSize, and keeps a
       monotonic step counter across the whole run (Adam needs this
       for 1-beta^iteration to decay correctly).

Optimizers/SGD.cs  (NEW)
------------------------
  Was: BaseOptimizer.Get("sgd") fell through to return null.
  Now: Full SGD optimizer with optional classical momentum
       (heavy-ball form: v <- mu*v - lr*g; param <- param + v) and
       optional inverse-time decay of the learning rate. Verified
       against hand-computed trajectories for both vanilla SGD and
       SGD+momentum.

Optimizers/BaseOptimizer.cs
---------------------------
  Was: Get("sgd") case was empty, returned null.
  Now: Get("sgd") returns a default SGD instance.

Smoke test summary
------------------
A standalone run exercising each fixed component on hand-computed
reference inputs reports 29 passed / 0 failed:
  Softmax forward (3 shape/value checks + 1 exact check) + backward
    (3 finite-difference components against analytical)
  Sigmoid (3 points along the activation curve)
  CCE (loss + 3 backward components)
  BCE (loss)
  Accuacy, BinaryAccuacy (per-row argmax + round)
  FullyConnected (shape checks for forward/backward with bias)
  SGD vanilla (w and b deltas) + SGD+momentum (two-step trajectory)
  BaseOptimizer.Get("sgd"), Get("adam")

Full NumSharp.UnitTest regression (6518 tests excluding OpenBugs/
HighMemory) remains green. End-to-end MnistMlp demo runs 100 epochs
in ~42s, final test accuracy 99.89% — unchanged from before the
infra fixes since that demo uses its own FusedActivation path.
Add a comprehensive NDArray reference page (docs/website-src/docs/ndarray.md) for the website. The new document provides a quick tour of NumSharp's NDArray: anatomy (Storage, Shape, TensorEngine), creation helpers, core properties, indexing and slicing (including Python slice syntax), views vs copies, operators and quirks, dtype conversion, 0-d scalars, element access methods, iteration, common patterns (reshape/ravel/transpose), generic NDArray<T>, saving/loading and interop, memory layout, equality semantics, troubleshooting, and an API reference. References to related docs (dtypes, broadcasting, exceptions, compliance) are included.
Eliminates the ~100x slowdown on transposed/sliced matmul inputs by
making every dtype take a stride-native code path. The MLP training
workaround (copying transposed views before np.dot) is dropped: the
kernel now absorbs arbitrary strides directly.

BACKGROUND

np.dot and np.matmul dispatch through DefaultEngine.MultiplyMatrix.
Previously:
  - float/double SIMD path (SimdMatMul.MatMulFloat / ILKernel double)
    REQUIRED contiguous inputs — any stride mismatch rejected it.
  - Fallback chain MatMulGeneric → MatMulCore → MatMulSameType /
    MatMulMixedType. The mixed-type path used left.GetValue(coords) /
    right.GetValue(coords) per element with boxing + Convert.ToDouble,
    giving the 240 ms vs 2.5 ms slowdown documented in the
    FullyConnectedFused backward comment (commit af8a4ad).

CHANGES

1. Stride-aware SIMD GEMM for float / double (BLIS-style GEBP)
   - New SimdMatMul.Strided.cs: generalizes MatMulFloat to accept
     (aStride0, aStride1, bStride0, bStride1). The 8x16 Vector256 FMA
     micro-kernel is stride-agnostic (reads from packed buffers);
     stride variation is absorbed by new packers:
       PackAPanelsStrided - fast path for aStride0==1 (transposed-
         contiguous A) does Vector256.Load of 8 rows per k; general
         scalar path otherwise.
       PackBPanelsStrided - fast path for bStride1==1 (row-contiguous
         B, same as the original path) uses 2xVector256.Load per k;
         bStride0==1 (transposed-contiguous B) reads K contiguous
         floats per column and scatter-stores; fully general scalar
         fallback otherwise.
     Contiguous inputs are detected and delegated to the existing
     MatMulFloatSimple / MatMulFloatBlocked so no perf regression on
     the already-fast path.
   - New SimdMatMul.Double.cs: stride-aware IKJ SIMD path with
     Vector256<double> (4 FMAs). Replaces the IL-generated contiguous
     double kernel in TryMatMulSimd so transposed double is covered.
   - SimdMatMul promoted to partial class.

2. Stride-native generic kernel for all non-SIMD dtypes
   - New Default.MatMul.Strided.cs:
       MatMulStridedGeneric - entry point; reads Shape.strides and
         Shape.offset, dispatches on (sameType, dtype).
       MatMulStridedSame<T> where T : unmanaged, INumber<T> - JIT
         specializes per type. Branches once on bStride1==1 to give
         the compiler a contig-B inner loop it can auto-vectorize;
         fully scalar path otherwise. Covers byte, int16, uint16,
         int32, uint32, int64, uint64, char, decimal, plus float /
         double when SIMD is disabled.
       MatMulStridedBool - OR-of-ANDs (NumPy bool matmul semantics),
         short-circuits when aik is false.
       MatMulStridedMixed<TResult> - mixed-type path with double
         accumulator and typed pointer reads via ReadAsDouble
         (switch on NPTypeCode). No more GetValue(coords) boxing or
         coord-array allocations in the inner loop.

3. Dispatcher rewrite
   - Default.MatMul.2D2D.cs:
       TryMatMulSimd passes strides + Shape.offset to the new SIMD
         kernels; only requirement is that C is contiguous.
       MultiplyMatrix falls through to MatMulStridedGeneric directly
         — no copies, no MatMulGeneric fallback.
   - Deleted (now dead): MatMulGeneric, MatMulCore<TResult>,
     MatMulSameType<T>, four MatMulContiguous overloads (float,
     double, int, long), MatMulMixedType<TResult>. ~165 lines gone.

4. FullyConnectedFused.cs: the two .copy() workarounds on
   Input.transpose() and W.transpose() are removed, along with the
   10-line apology comment. np.dot now hits the SIMD fast path
   directly on the transposed views.

TEST COVERAGE

New MatMulStridedTests (28 tests):
  - All 4 BLAS transpose cases (NN, NT, TN, TT) x float/double
    x simple/blocked path with bit-exact comparison vs copy+matmul.
  - Per-dtype stride-native coverage: byte, int16, uint16, int32,
    uint32, int64, uint64, char, decimal, bool — TN and NT patterns,
    plus sliced-row patterns exercising bStride1==1 fast branch.
  - Sliced view with Shape.offset > 0 (2D slice) for both float and
    int64 — exercises the offset adjustment in the dispatcher.
  - Mixed-type: int32 @ float32 (transposed) exercises MixedDispatch
    with ReadAsDouble typed reads.
  - MLP-shape regression (784x64 x 64x128 and 64x128 x 128x784) —
    the exact shapes from FullyConnectedFused.Backward.

Full suite: 6710/6710 pass on net8.0 and net10.0 (0 regressions,
11 HighMemory skipped).

PERFORMANCE (stride-native vs copy+matmul)

MLP gradients (float32, shape from FullyConnectedFused):
  inputT(784,64) @ gradPreact(64,128): 1 ms (was 240 ms per the
    comment removed from FullyConnectedFused).
  gradPreact(64,128) @ wT(128,784): 1 ms.

Integer stride-native (150,200) @ (200,150):
  int32: 10 ms stride-native, 11 ms copy+matmul.
  int64: 11 ms stride-native, 11 ms copy+matmul.

Large float blocked path, Lt(400,500) @ L(500,400):
  8 ms stride-native vs 12 ms copy+matmul (skips the copy allocation).

IMPLEMENTATION NOTES

- All paths handle Shape.offset on the base pointer: pointers are
  advanced by shape.offset elements before being passed to the
  kernel, matching the (byte*)arr.Address + offset*dtypesize pattern
  used elsewhere in DefaultEngine.
- Packing cost is O(M*K + K*N) vs matmul's O(M*N*K), so the packer
  overhead is bounded at 1/N + 1/M of total work — negligible for
  any shape that cares about matmul perf.
- The INumber<T> generic kernel is a pattern .NET's JIT recognizes
  and auto-vectorizes for primitive numeric types, so non-SIMD
  dtypes still get implicit vectorization without hand-written
  intrinsics code.
…erage

Aligned np.tile with NumPy 2.x's _shape_base_impl.py behavior and verified
byte-identical output across 94 cases (34 curated + 60 random fuzzed).

Implementation fixes:
- Dropped `params` from the long[] overload (np.tile.cs:33). The dual
  `params int[]` + `params long[]` overloads caused a compiler ambiguity
  when calling `np.tile(a)` with no reps — which is valid NumPy
  (`np.tile(a, ())` returns a copy, ndmin=0). The int[] overload now
  handles the no-args call cleanly and the long[] overload still accepts
  explicit long[] arrays.

Verified against NumPy's _shape_base_impl.py line-by-line:
- ndmin=d promotion (leading 1s on A.shape when d > A.ndim)
- All-ones shortcut returning a copy with promoted ndim
- d < c.ndim → prepend 1s to reps (reps promoted to A.ndim)
- Zero-size handling (any rep==0 or aShape==0 → empty array of correct
  shape and dtype)
- shape_out = c.shape * tup per axis

NumSharp uses broadcast+copy+reshape where NumPy uses iterative
reshape(-1,n).repeat(nrep,0). Mathematically equivalent, single-pass
materialization. Output is byte-identical across all tested cases.

Test suite expanded 22 → 37 cases covering:
- Empty reps `np.tile(a)` — copy with original shape, ndim preserved
- Transposed (non-contiguous, strides=[8,24]) input
- Broadcasted input (stride=0) — output is writable even though source isn't
- Sliced view input (`arr[::2]`)
- Zero reps at leading / trailing axis
- A.ndim > len(reps) — reps prepended with 1s (2D-with-scalar-reps,
  3D-with-scalar-reps)
- 4-D with mixed reps (2,1,3,1)
- reps_len > A.ndim — leading size-1 axes prepended to A
- Full dtype coverage for all 12 NumSharp types (byte, short, ushort,
  int, uint, long, ulong, float, double, decimal, bool, char) with
  value verification
- Independence: mutating output doesn't touch source

Results:
- 37/37 tile tests pass on net8.0 + net10.0
- Full test suite regression check: 6748/6748 passing, no regressions
Follow-up to 2f7d4195 (np.tile implementation). Updates the project's
internal docs to reflect that np.tile is now supported, and converts the
pre-existing OpenBugs sentinel into a real (passing) test.

CLAUDE.md
---------
- Missing Functions count: 20 → 19.
- Manipulation row: removed `np.tile` from the missing list.
- Shape Manipulation supported list: added `tile` between `swapaxes` and
  `transpose`.

OrderSupport.OpenBugs.Tests.cs
------------------------------
The Section 30 "Tile_ApiGap" test was a placeholder that asserted
`false.Should().BeTrue("np.tile is not implemented")` under [OpenBugs].
Now that np.tile exists, the test is rewritten as a minimal smoke check
of the canonical NumPy doc example (np.tile([1,2,3], 2) → [1,2,3,1,2,3])
and the [OpenBugs] attribute is removed, so it runs in CI.

Net: +1 passing test from the existing [OpenBugs] count.
…r.Assign to NpyIter.Copy

Adds an NpyIter.Copy(dst, src) umbrella that subsumes the legacy
MultiIterator.Assign -- same broadcast/stride/cross-dtype semantics, but
routed through the SIMD-fast TryCopySameType kernel for matching dtypes
and a new strided-cast helper for cross-dtype. All 11 production call
sites in NumSharp.Core + the Bitmap project are migrated. The legacy
MultiIterator type itself is left in place (still referenced by tests
via AsIterator<T> wrappers in a separate migration phase).

Background
==========
MultiIterator.Assign(lhs, rhs) was the catch-all path for "copy rhs into
lhs with broadcast and stride awareness, casting if dtypes differ." It
worked by constructing two legacy NDIterator<T> instances (one per
operand) and walking them lockstep with per-element delegate dispatch
(MoveNext/MoveNextReference Funcs), reading rhs as its stored dtype and
converting to lhs.TypeCode via Converts.FindConverter on every read.

Performance baseline:
  - Per-element delegate call (Func<T> for MoveNext / Func<bool> for
    HasNext) is ~5-10 ns on its own.
  - The Converts.FindConverter delegate is another per-element call.
  - Coordinate-based offset calculation re-walks the shape every step
    via ValueCoordinatesIncrementor.
  - For matching dtypes this was already wasteful; the existing
    NpyIter.TryCopySameType IL kernel did the same job 3-10x faster.
  - For mismatched dtypes there was no faster path.

Five of the eleven sites already prefixed an
`if (NpyIter.TryCopySameType(...))` guard and only fell through to
MultiIterator.Assign on dtype mismatch -- we just collapse that into one
call now.

NpyIter.Copy(dst, src) -- new umbrella
======================================
Lives next to TryCopySameType in the static NpyIter class
(Backends/Iterators/NpyIter.cs). Strategy:

  1. Same dtype -> TryCopySameType. Existing IL copy kernel covers
     contiguous (memcpy via cpblk) and strided/broadcast (struct-generic
     INumber<T> kernel that the JIT auto-vectorizes per dtype).

  2. Cross dtype -> CopyStridedToStridedWithCast. Reuses the same
     CreateCopyState (broadcast-aware shape resolution + axis
     coalescing) but runs a per-element NpyIterCasting.ConvertValue
     loop. Adds the stride-aware-on-both-sides variant -- the existing
     NpyIterCasting helpers covered (strided->contig) and
     (contig->strided) but not (strided->strided), which is what arises
     when both operands are non-contiguous (broadcast src AND
     transposed/sliced dst).

The cast path is still scalar (one ConvertValue per element) but it
benefits from coalesced-axis iteration (1-D walk on contiguous-pair
inputs). The JIT-friendly INumber<T> SIMD path for matching dtypes was
the bigger common-case win; the cast path is rare in practice and
parity-correct here.

NpyIterCasting.cs -- new helper
===============================
CopyStridedToStridedWithCast: combines the shape-driven coordinate
walking of CopyStridedToContiguousWithCast and
CopyContiguousToStridedWithCast. Walks innermost-axis-first (C-order),
applies element-size multiplication via InfoOf.GetSize on both source
and destination, and accepts stride=0 dims (broadcast) without special-
casing -- they just contribute zero to srcOffset.

Migrated call sites (10 in NumSharp.Core + 1 in Bitmap)
=======================================================

A. UnmanagedStorage.Setters.cs -- 4 sites (lines 316, 382, 432, 483):
   Each was the broadcast/sliced fallback in SetData(NDArray|IArraySlice,
   int[]|long[]). Mechanical 1-line API swap. Comment-only reference at
   line 327 also updated for consistency.

B. UnmanagedStorage.Cloning.cs:378 -- non-contig CloneData() fallback.
   Already had the TryCopySameType guard at line 377; collapsed both
   into one Copy() call. -1 LOC.

C. UnmanagedStorage.cs:1439 -- CopyTo<T>(T*) non-contig path. Wraps the
   destination raw pointer in an UnmanagedStorage and invokes Copy.

D. NDArray.String.cs:91 -- char-array string materialization for
   non-contig sliced storage. Same-dtype char->char hits the SIMD
   TryCopySameType fast path now.

E. np.copyto.cs:29 -- np.copyto's broadcast/cast fallback. Also had a
   TryCopySameType guard; collapsed. -2 LOC.

F. np.concatenate.cs:107 -- per-axis-index slice copy in the concatenate
   inner loop.

G. NDArray.Copy.cs:35 -- F-order copy fallback. Had the TryCopySameType
   guard; collapsed. -1 LOC.

H. NumSharp.Bitmap/np_.extensions.cs:254 -- non-contiguous bitmap
   scanline copy. Required adding InternalsVisibleTo("NumSharp.Bitmap")
   to NumSharp.Core/Assembly/Properties.cs since NpyIter.Copy is
   internal (consistent with the rest of the NpyIter surface -- kept
   internal because NpyIterRef is an internal ref struct). Also fixes
   the pre-existing OpenBugs.Bitmap.cs:48 / Bug 4 ("ToBitmap on
   non-contiguous sliced array fails inside MultiIterator") because
   NpyIter.Copy correctly handles arbitrary stride.

Behavioral divergence found and fixed
=====================================
NDArray.Indexing.Masking.cs:SetBooleanMaskAxis0 had a hidden dependency
on legacy MultiIterator.Assign's BIDIRECTIONAL broadcast (Shape.Broadcast
returns a common shape; both operands stretch to fit). NumPy's np.copyto
(and now NpyIter.Copy) is one-directional: src must be broadcastable to
dst.shape.

The Case6_Assignment_BroadcastValue test exposed this. The NumPy-level
operation was:

    arr2d = np.arange(12).reshape(3, 4)
    arr2d[[True, False, True]] = np.array([[100, 101, 102, 103]])

NumPy resolves this by computing the masked target shape (2, 4) and
broadcasting the (1, 4) value to it. NumSharp iterates row-by-row and
called MultiIterator.Assign(row_of_shape_4, value_of_shape_1_4) per
selected row. Legacy bidirectional broadcast made this work
accidentally; the new strict NpyIter.Copy correctly rejects (1,4) -> (4)
because that is not a valid NumPy broadcast.

Fix: in the per-row branch (the catch-all "Broadcast value to
destination" path), squeeze leading singleton axes from value until
its ndim matches destSlice.ndim. This matches what the NumPy mask-assign
high-level operation produces row-by-row.

Verification
============
- Smoke test (8 cases) compares NpyIter.Copy vs MultiIterator.Assign
  for: same-dtype contig->contig, cross-dtype int64->float64 broadcast,
  cross-dtype int32->float32 with transpose stride, cross-dtype
  double->int truncation, same-dtype broadcast 1D->2D, empty array,
  random parity, broadcast+stride combo. All 8 pass byte-for-byte.

- Full suite (TestCategory!=OpenBugs&TestCategory!=HighMemory) on
  net8.0 + net10.0: 6748/6748 pass, zero failures.
  Baseline before this PR: 6710. Net delta +38 (22 new TileTests in
  prior commit + 1 unmarked Tile_ApiGap + 15 from misc fixes/
  un-skipped tests since baseline measurement).

Performance notes
=================
- Same-dtype callers (90%+ of traffic in practice -- most SetData/
  copyto/concatenate calls are typed assignments where source and dest
  match): no measurable change. Already used the IL copy kernel via the
  existing TryCopySameType guards or now hit it through the umbrella.

- Cross-dtype callers (rare): per-element scalar ConvertValue is on par
  with the legacy delegate-based path -- both pay one virtual call per
  element. The NpyIter coordinate-walking is slightly tighter (single
  stackalloc coords vs ValueCoordinatesIncrementor allocation) but this
  is not on a hot path.

- The MultiIterator.Assign call-graph is now empty in src/. The file
  itself (314 LOC) and the 12 NDIterator.Cast.* partials remain because
  test code still uses the AsIterator<T> extension method (which the
  next migration phase will redirect to an NpyIter wrapper, deleting
  the legacy types entirely).

Risks
=====
- The NumSharp.Bitmap project now depends on NumSharp.Core internals
  via InternalsVisibleTo. Bitmap was already consuming
  UnmanagedStorage/UnmanagedMemoryBlock<byte> directly so the
  internal-types coupling already existed; this just adds NpyIter.Copy.

- One existing call site (SetBooleanMaskAxis0 catch-all) was relying
  on undocumented bidirectional broadcast leniency. The fix makes
  NumSharp's mask-assign more NumPy-aligned (1-directional broadcast
  per copy call) and the singleton-squeeze handles the row-by-row case
  the legacy code was getting "for free" from the bidirectional
  broadcast. Other call sites that went through MultiIterator.Assign
  with intentional bidirectional broadcast would break the same way --
  none surfaced in the 6748-test suite.
Expose the entire NpyIter / NpyIterRef / NpyIterState / NpyExpr system
as a first-class public API so NumSharp consumers can drive the nditer
core, plug custom inner loops, compose expression trees, and implement
custom reduction/axis kernels — matching NumPy's `np.nditer` extensibility.

=== NDArray-first API on NpyIter ===

NpyIter's static helpers (Copy, ReduceBool, TryCopySameType) previously
required consumers to unwrap `.Storage` manually. Added NDArray overloads
that forward to the existing UnmanagedStorage overloads — keeping Storage
as a secondary entry point for low-level code that constructs fresh
buffers from raw pointers (bitmap Scan0, pinned strings, etc.).

Copy's full XML doc now lives on the NDArray overload; the Storage
overload inherits it via <inheritdoc>, making NDArray the primary form.

Migrated 7 in-hand NDArray callsites to drop the `.Storage` wart:
 - np.copyto(NDArray, NDArray)
 - NDArray.copy()
 - np.concatenate (per-axis writeTo/writeFrom inner loop)
 - Default.All / Default.Any impl dispatch (both generic and decimal)

Remaining `.Storage` callsites are legitimately Storage-native (inside
UnmanagedStorage methods, or wrapping raw T* / ArraySlice / bitmap
Scan0 buffers where no NDArray exists).

=== Full public API promotion ===

Every `internal` access modifier in src/NumSharp.Core/Backends/Iterators/
has been removed. Users can now:

1. Call NpyIter's static helpers:
     NpyIter.Copy(ndDst, ndSrc)
     NpyIter.ReduceBool<T, NpyAllKernel<T>>(nd)
     NpyIter.TryCopySameType(ndDst, ndSrc)
     NpyIter.CreateCopyState / CreateReductionState

2. Drive NpyIterRef directly:
     NpyIterRef.New / MultiNew / AdvancedNew
     Iternext, Reset, GotoIterIndex, GotoIndex, GotoMultiIndex,
     GetDataPtrArray, GetInnerStrideArray, GetInnerLoopSizePtr,
     GetIterView, GetValue<T>, SetValue<T>, RemoveAxis,
     RemoveMultiIndex, EnableExternalLoop, Copy, Dispose, RawState

3. Plug custom inner loops via:
     NpyInnerLoopFunc delegate (raw)
     INpyInnerLoop interface (struct-generic, zero-alloc)
     INpyReducingInnerLoop<TAccum> (accumulator-threaded)
     NpyIterInnerLoopFunc / NpyIterNextFunc / NpyIterGetMultiIndexFunc

4. Implement custom reduction/axis kernels via:
     INpyBooleanReductionKernel<T>
     INpyAxisNumericReductionKernel<T>
     INpyAxisSameTypeKernel<T>
     INpyAxisDoubleReductionKernel
     INpyIterKernel

5. Use predefined kernel structs:
     NpyAllKernel<T>, NpyAnyKernel<T>
     NpySumAxisKernel<T>, NpyProdAxisKernel<T>,
     NpyMaxAxisKernel<T>, NpyMinAxisKernel<T>
     CumSumAxisKernel<T>, CumProdAxisKernel<T>
     VarAxisDoubleKernel, StdAxisDoubleKernel
     CountNonZeroKernel<T>  (INpyReducingInnerLoop<long>)

6. Drive NpyAxisIter for axis-scoped ops:
     ExecuteSameType, ReduceDouble, ReduceBool, ReduceNumeric

7. Build expression trees via NpyExpr (Tier 3C custom-op API):
     NpyExpr.Input / Const / Add / Mul / Div / Power / Sqrt / Exp / ...
     expr.Compile(inputTypes, outputType, cacheKey) -> NpyInnerLoopFunc
     Subclass NpyExpr and override EmitScalar / EmitVector /
       SupportsSimd / AppendSignature for custom IL-emitting nodes
     Use NpyExprCompileContext in custom overrides
     InputNode, ConstNode, BinaryNode, UnaryNode, ComparisonNode,
       MinMaxNode, WhereNode, CallNode sealed impls
     DelegateSlots static registry for user-bound delegates

8. Access supporting infrastructure:
     NpyIterState (iterator state, low-level)
     NpyAxisState (axis-iterator state)
     NpyIterCasting (cross-dtype cast helpers)
     NpyIterCoalescing (axis coalescing)
     NpyIterBufferManager (aligned buffer alloc)
     NpyIterPathSelector / NpyIterExecution (path dispatch)
     Low-level pointer helpers on NpyIter:
       CoalesceAxes, UpdateLayoutFlags, IsContiguous, Advance
     Constants: StackAllocThreshold, MaxDims

Only `private` members remain — those are backing fields, private
helpers, and private caches (e.g., DelegateSlots._delegates). These are
standard encapsulation, not API surface.

=== Scope ===

Files promoted (Iterators/ folder, 12 files):
 - NpyIter.cs + .State.cs + .Execution.cs + .Execution.Custom.cs
 - NpyIterKernels.cs, NpyLogicalReductionKernels.cs
 - NpyAxisIter.cs + .State.cs
 - NpyExpr.cs (all expression nodes + compile context + DelegateSlots)
 - NpyIterCasting.cs, NpyIterCoalescing.cs, NpyIterBufferManager.cs

Callsite migrations (5 files):
 - Creation/NDArray.Copy.cs, Creation/np.concatenate.cs
 - Manipulation/np.copyto.cs
 - Backends/Default/Logic/Default.All.cs + Default.Any.cs

=== Verification ===

Full build clean across net8.0 + net10.0.
All 6,748 tests pass on both frameworks (filter: !OpenBugs & !HighMemory).
… count_nonzero, np.where to NpyIter

Continues the legacy iterator migration started in commit 65e64618. Replaces
remaining AsIterator<T> call sites in five production paths with the
NpyIter-based iteration machinery, keeping the full test suite at 6,748
passing on net8.0 and net10.0.

API surface change: elevates NpyIter types (NpyIterRef, NpyIter static class,
NpyIterState, kernel interfaces, delegates, flags, NPY_ORDER/NPY_CASTING,
NpyExpr nodes, INpyInnerLoop / INpyReducingInnerLoop, boolean/axis kernel
interfaces) from internal to public so external callers — including the
downstream NumSharp.Bitmap project — can consume the new iterator without an
InternalsVisibleTo entry. Dropping the NumSharp.Bitmap InternalsVisibleTo in
Properties.cs because NpyIter.Copy is now public.

Adds NDArray-accepting overloads for NpyIter.Copy, NpyIter.TryCopySameType,
and NpyIter.ReduceBool so call sites can pass NDArray directly without going
through .Storage. Call sites in np.copyto, np.concatenate, NDArray.Copy,
Default.All, and Default.Any use the new overload.

Step 5: NaN reductions (~150 LOC saved)
---------------------------------------
New file: src/NumSharp.Core/Backends/Iterators/NpyNanReductionKernels.cs

Adds INpyReducingInnerLoop<TAccum> struct kernels for scalar NaN reductions:
- NanSumFloatKernel / NanSumDoubleKernel         (accum = float/double)
- NanProdFloatKernel / NanProdDoubleKernel       (accum = float/double)
- NanMinFloatKernel / NanMinDoubleKernel         (accum = NanMinMax*Accum)
- NanMaxFloatKernel / NanMaxDoubleKernel         (accum = NanMinMax*Accum)
- NanMeanFloatKernel / NanMeanDoubleKernel       (accum = NanMeanAccumulator, sum+count)
- NanSquaredDeviationFloatKernel / ...Double      (pass 2 of two-pass variance)

Accumulator types:
- NanMeanAccumulator      { double Sum; long Count; }
- NanMinMaxFloatAccumulator  { float  Value; bool Found; }
- NanMinMaxDoubleAccumulator { double Value; bool Found; }

All kernels process one inner-loop chunk at a time using per-operand byte
strides, so they transparently support contiguous, sliced, broadcast, and
transposed layouts without any code branching.

Migrated AsIterator call sites:
- np.nanmean.cs (nanmean_scalar): one-pass sum+count, AsIterator -> ExecuteReducing
- np.nanvar.cs  (nanvar_scalar):  two-pass mean + squared deviation
- np.nanstd.cs  (nanstd_scalar):  two-pass mean + squared deviation + sqrt
- Default.Reduction.Nan.cs (NanReduceScalarFloat, NanReduceScalarDouble):
  Sum/Prod/Min/Max over 12 AsIterator loops collapsed to 8 ExecuteReducing calls

The two-pass variance path preserves numerical behavior of the legacy code
exactly (no Welford switch) — pass 1 accumulates sum/count, pass 2 accumulates
sum of (value - mean)^2 with mean held in the kernel struct's field.

Step 6: count_nonzero + BooleanMask (~40 LOC saved)
---------------------------------------------------
New kernel: CountNonZeroKernel<T> in NpyLogicalReductionKernels.cs
Generic over any unmanaged T, uses EqualityComparer<T>.Default.Equals
(devirtualized by the JIT per struct instantiation) for the != default check.

Migrated:
- Default.NonZero.cs (count_nonzero<T>): strided fallback now uses
  NpyIterRef + ExecuteReducing<CountNonZeroKernel<T>, long>. Contiguous fast
  path untouched.
- Default.BooleanMask.cs (BooleanMaskFallback): two-pass migration.
  Pass 1 counts trues via NpyIter (same kernel reused). Pass 2 gathers via
  a 2-operand NpyIter (arr + mask, NPY_CORDER for lockstep C-order traversal)
  with an accumulator-threaded BooleanMaskGatherKernel that stores the write
  cursor and destination pointer as ref-updated state.

NPY_CORDER is required on the gather pass because boolean indexing is
defined in logical C-order, not memory-efficient iteration order; NPY_KEEPORDER
on a transposed array produces wrong results (Case12_TransposedArray_BooleanMask).

Step 7: np.where (3-5x perf win on non-contig path)
---------------------------------------------------
Migrated np.where.cs WhereImpl<T>: the 4-lockstep AsIterator<bool> +
AsIterator<T> x 3 path is replaced with a single 4-operand NpyIter
compiling Where(Input(0), Input(1), Input(2)) into a SIMD-capable IL kernel
via ExecuteExpression. Cache key is $"np.where.{dtype}" so repeated calls hit
the cached compiled kernel.

Also moves the condition-to-bool cast from implicit AsIterator<bool>
per-element casting into an explicit cond.astype(Boolean, copy: false) at
the top of where_internal. This also lets the SIMD fast path (canUseKernel)
handle non-bool conditions, closing a pre-existing behavioral asymmetry.

Bug discovered (collected, not fixed here)
-------------------------------------------
NpyIterRef.New(arr) / MultiNew without NpyIterGlobalFlags.EXTERNAL_LOOP
exposes wrong inner-loop counts. Each kernel invocation is called with
count == IterSize but the base data pointer only advances by one element
between calls, so the kernel reads past the end of the array. Workaround:
pass NpyIterGlobalFlags.EXTERNAL_LOOP on every NpyIterRef.New call for bulk
iteration. All migrated call sites above use EXTERNAL_LOOP consistently.

Test impact
-----------
Full suite (TestCategory!=OpenBugs&TestCategory!=HighMemory):
  Before: 6,748 passed
  After:  6,748 passed (no regressions, no new tests added this phase)
Framework coverage: net8.0 + net10.0.
…merator migrated off AsIterator

Step 8 — random sampling and multi-dim array casting
-----------------------------------------------------
np.random.dirichlet:
  When alpha comes in as an NDArray (possibly strided / wrong dtype), the
  foreach + AsIterator<double> copy into a local flat double buffer is
  replaced by an NpyIter.Copy wrapping the destination ArraySlice as an
  UnmanagedStorage (with Shape.Vector(k)). NpyIter.Copy absorbs both the
  source layout and the any-numeric-dtype->double cast in one call.

np.random.multivariate_normal:
  Same pattern as dirichlet — mean.AsIterator<double> copy into meanSlice
  becomes NpyIter.Copy(meanStorage, mean.Storage). The cov copy loop is
  left as GetDouble-per-element for now because it already needs element
  traversal (SVD decomposition consumes cov immediately after).

NDArray.ToMuliDimArray<T>:
  Replaces the AsIterator + ValueCoordinatesIncrementor + per-element
  Array.SetValue loop (one boxed runtime type check per element) with a
  single Storage.ToArray<T>() call followed by Buffer.BlockCopy into the
  multi-dimensional destination array. Both .NET multi-dim arrays and
  NumSharp arrays are row-major (C-order), so the flat buffer lines up
  directly with the destination's linear backing storage.

  Decimal is not a primitive so Buffer.BlockCopy rejects it; that one
  dtype falls back to the coordinate-walk + SetValue path. All 11 other
  supported dtypes now take the bulk-memcpy fast path (expected 5-10x for
  primitives depending on array size, mostly due to eliminating the
  runtime type-check per SetValue).

Step 9 — NDArray.GetEnumerator
------------------------------
The 12-dtype switch over `new NDIterator<T>(this, false).GetEnumerator()`
collapses to a single `_iter1D<T>()` helper that materializes via
Storage.ToArray<T>() (which already has a Buffer.MemoryCopy fast path for
contiguous arrays and a coordinate-walk for strided). Foreach over a flat
T[] avoids the per-element Func<T> delegate calls of the legacy iterator.

For large 1-D arrays this allocates an additional T[] equal to the array
size. Consumers of GetEnumerator are typically pretty-printing / format
routines, which already allocate strings proportional to the data, so the
transient extra allocation is not a net regression.

np.broadcast.iters — deferred
-----------------------------
Broadcast.iters is declared `public NDIterator[]` which is part of the
external API surface. Migrating it requires first reworking NDIterator
itself (Step 10) so that AsIterator returns an NpyIter-backed wrapper.
Left unchanged in this commit; the type-name stays but the underlying
implementation gets swapped when Step 10 lands.

Test impact
-----------
Full suite still at 6,748 / 6,748 passing on both net8.0 and net10.0 with
the CI filter (TestCategory!=OpenBugs&TestCategory!=HighMemory).
…gacy files (-3870 LOC)

Step 10 of the legacy iterator migration. With the production call sites
in Phases 1+2 already migrated off of MultiIterator.Assign and the raw
delegate-based AsIterator hot loops, the entire legacy iterator core can
now be removed. What remains is a thin backward-compatibility shim so
that ~86 existing call sites in tests and documentation continue to
compile untouched.

What gets deleted (3,870 net LOC removed)
-----------------------------------------
- src/NumSharp.Core/Backends/Iterators/NDIteratorCasts/
    NDIterator.Cast.Boolean.cs      (254 LOC)
    NDIterator.Cast.Byte.cs         (254 LOC)
    NDIterator.Cast.Char.cs         (254 LOC)
    NDIterator.Cast.Decimal.cs      (254 LOC)
    NDIterator.Cast.Double.cs       (254 LOC)
    NDIterator.Cast.Int16.cs        (254 LOC)
    NDIterator.Cast.Int32.cs        (254 LOC)
    NDIterator.Cast.Int64.cs        (254 LOC)
    NDIterator.Cast.Single.cs       (254 LOC)
    NDIterator.Cast.UInt16.cs       (254 LOC)
    NDIterator.Cast.UInt32.cs       (254 LOC)
    NDIterator.Cast.UInt64.cs       (254 LOC)
    (12 Regen-generated partials dispatching per source dtype)
- src/NumSharp.Core/Backends/Iterators/NDIterator.template.cs  (255 LOC)
    (the Regen source template the Cast files were generated from)
- src/NumSharp.Core/Backends/Iterators/MultiIterator.cs        (313 LOC)
    (now dead — all MultiIterator.Assign call sites migrated in Phase 1)
- src/NumSharp.Core/Backends/Iterators/IteratorType.cs         (9 LOC)
    (the Scalar/Vector/Matrix/Tensor dispatch enum — NpyIter makes it
    redundant since it picks the right traversal automatically)

What stays (with a new, much simpler implementation)
----------------------------------------------------
- NDIterator.cs — reduced from 482 LOC of per-path dispatch
  (AutoReset x Sliced x Scalar/Vector/Matrix/Tensor x NoCast/Cast and
  their full factorial combinations with delegate captures) to 172 LOC
  of a single path:

    1. ctor wraps the src IMemoryBlock as an UnmanagedStorage via
       CreateBroadcastedUnsafe (bypasses the size-match check so broadcast
       shapes with stride=0 axes work).
    2. Materialize allocates a fresh contiguous NDArray with fresh
       C-order strides of TOut dtype and calls NpyIter.Copy — which
       absorbs source layout (contiguous/sliced/strided/transposed),
       broadcast, and any-to-TOut dtype conversion in one SIMD-capable
       pass.
    3. MoveNext/HasNext/Reset/MoveNextReference become trivial pointer
       arithmetic on the materialized buffer.

  The target shape's dimensions are cloned and passed to `new Shape(dims)`
  so the destination has fresh row-major strides and is writeable (the
  input shape may be a read-only broadcast view with stride=0 axes that
  would trip NumSharpException.ThrowIfNotWriteable otherwise).

- INDIterator.cs — removed `IteratorType Type { get; }` member since
  the enum is gone. The rest of the interface (Block, Shape,
  BroadcastedShape, AutoReset, MoveNext<T>, MoveNextReference<T>,
  HasNext, Reset) is preserved. np.Broadcast.iters of type NDIterator[]
  continues to work unchanged.

Trade-off
---------
Iteration now allocates O(size) backing memory up front instead of
walking coordinates lazily. In exchange the per-element hot path is
just `*(ptr + cursor++)` — no delegate dispatch, no ValueCoordinatesIncrementor
arithmetic, no Converts.FindConverter closure capture, no
IteratorType-based switch. For iteration patterns that read all or most
of the elements (which is the common case for .AsIterator users) this
is a net perf win; for patterns that early-exit after reading a handful
of elements, the up-front materialization is wasted work.

MoveNextReference now always returns a reference into the materialized
buffer rather than the source, so callers that used MoveNextReference
as a write-port into the source array will silently write to the local
buffer. This is a behavioral change from the legacy path which supported
MoveNextReference over the original storage in certain non-cast Scalar/
Vector paths. No in-tree caller relied on that behavior (the remaining
test and benchmark usages are all read-only).

Test impact
-----------
Full CI suite still 6,748 / 6,748 passing on net8.0 and net10.0 with
(TestCategory!=OpenBugs&TestCategory!=HighMemory). The NumSharp.Bitmap
and NumSharp.Benchmark projects both build.

Cumulative Phase 1 + Phase 2 impact
-----------------------------------
Twenty-five legacy iterator call sites across eleven files migrated to
NpyIter across six commits. Roughly 4,000 lines of legacy iterator code
deleted, with the same test suite pass count (6,748) maintained at each
step. The remaining AsIterator callers (tests, benchmarks, a couple of
documentation ref comments) no longer pull in any per-dtype Cast switch
or MultiIterator code path — they go through the thin NDIterator
wrapper backed entirely by NpyIter.
…thout EXTERNAL_LOOP

Three symptoms of one bug in NpyIter.Execution.cs. The driver loops —
ForEach, ExecuteGeneric(Single/Multi), and ExecuteReducing — pulled
their per-call count from `GetInnerLoopSizePtr()`, which always returns
`&_state->Shape[NDim - 1]` when the iterator isn't BUFFER'd. In EXLOOP
mode that's correct: `iternext` (via ExternalLoopNext) advances
`IterIndex` by `Shape[NDim - 1]` per call.

But in the default non-EXLOOP non-BUFFER mode, `iternext` (via
StandardNext) only advances by one element per call — `state.Advance()`
increments `IterIndex` by 1. The kernel was still told `count =
Shape[NDim - 1]`, so:

1. The kernel reads `Shape[NDim - 1]` elements starting at the current
   data pointer, which extends past the last valid element of the
   source array.
2. The driver then calls iternext, which advances the pointer by one
   element.
3. The next kernel call reads `Shape[NDim - 1]` elements starting one
   element later — again past the end — and so on.

Net effect: an N-element 1-D array triggers N kernel invocations, each
reading N "elements" (with massive overlap), the last ~N-1 of which
read uninitialized memory. For `np.array([1, 2, NaN, 4, 5])` the
returned NanSum was 46 instead of 12 because the kernel saw the array
plus four trailing garbage floats added together four times over.

Discovered during the Phase 2 migration when wiring the NaN reduction
kernels into NpyIter. Worked around at the call sites by always passing
`NpyIterGlobalFlags.EXTERNAL_LOOP`, which keeps iterNext and
GetInnerLoopSizePtr in agreement.

This commit fixes the bug at the source so future callers don't need
the workaround. Approach:

- New helper `ResolveInnerLoopCount()` returns the correct count given
  the current flag combination:
    BUFFER:  _state->BufIterEnd
    EXLOOP:  _state->Shape[NDim - 1]
    else:    1
- ForEach, ExecuteGenericSingle, ExecuteGenericMulti, ExecuteReducing
  use ResolveInnerLoopCount instead of dereferencing
  GetInnerLoopSizePtr. BUFFER mode still reads the pointer per
  iteration because buffer fills can shrink at the tail.

Both EXLOOP and non-EXLOOP paths now produce correct results. The
existing Phase 2 call sites keep EXLOOP because it's the SIMD-optimal
mode (one call covers the whole inner dimension), but callers who omit
the flag no longer get silently-wrong output.

Test impact: 6,748 / 6,748 passing on net8.0 and net10.0, plus the
bug-repro smoke test (NanSum over a strided 1-D array without
EXTERNAL_LOOP) now returns the correct sum on the fly.
Project-specific CLAUDE.md at examples/NeuralNetwork.NumSharp/.claude/
so future agents working in the example project get the right context
without needing to rediscover everything from the code.

Contents (~280 lines):
  * Build / Run — csproj setup (Exe, net8+net10, AllowUnsafeBlocks),
    InternalsVisibleTo scope, where to drop real MNIST IDX files,
    current demo defaults (epochs=100, batch=128, Adam lr=1e-3,
    synthetic sigma=2.5, eval cadence min(5, epochs)).
  * Directory Map — every file with a one-line purpose.
  * MnistMlp fusion — the three NpyExpr trees that collapse the
    post-matmul element-wise chunks into single NpyIter kernels
    (forward ReLU bias+activation, forward linear bias-only,
    backward ReLU gradient mask).
  * Layer/Cost/Optimizer contract — what every BaseLayer subclass must
    populate (Input/Output/Grads/InputGrad, Parameters["w"/"b"]).
  * Sharp edges — 8 gotchas: historical np.dot strided 100x cliff
    (now fixed by the stride-aware GEMM), 2-index `x[i,j]` vs slice,
    argmax needing axis, np.allclose mutating its arguments via
    astype(copy:false), argmax returning Int64 not Int32, Adam's
    step counter needing monotonic iteration, pre-fix FC weight init,
    slice dtype.
  * Perf characteristics — 100-epoch run numbers, fusion probe, kernel
    cache + delegate-slot instrumentation.
  * Testing — the in-line `dotnet_run` smoke-test pattern.
  * Q&A — why Accuacy/BinaryAccuacy keep the typo, why
    SoftmaxCrossEntropy lives in MnistMlp/ rather than Cost/, when to
    use NeuralNet.Train vs MlpTrainer, real-MNIST expected accuracy.
  * Known limitations — no shuffling, no validation split, Adam
    re-allocates per step, no serialization, string-vs-enum activation
    inconsistency between FullyConnected and FullyConnectedFused.
… copy

Previous rewrite (commit 87f90a2b) backed NDIterator<TOut> with an
eagerly-materialized NDArray buffer — it ran NpyIter.Copy at construction
time into a contiguous TOut-typed buffer and walked that buffer on
MoveNext. Simple, but allocated O(size * sizeof(TOut)) up front even for
callers that read one element and walk away, or abandon iteration early.

This commit drops the materialization. MoveNext now reads each element
lazily from the source layout:

- Same-type, contiguous, offset == 0:
    Direct `*((TOut*)addr + cursor++)`. One pointer increment per call,
    no coordinate arithmetic, no branch. Matches the legacy contiguous
    fast path.

- Same-type, strided / sliced / broadcast / offset != 0:
    Walks offsets with ValueOffsetIncrementor (or
    ValueOffsetIncrementorAutoresetting when AutoReset is set). The
    incrementor updates one coordinate per call amortized O(1), with
    occasional O(ndim) carry-propagation for wrap-around. Same algorithm
    the legacy code used for its Matrix/Tensor sliced paths.

- Cross-type (source dtype != TOut):
    Offset-walks the source at its native dtype, reads a TSrc element,
    and passes it through `Converts.FindConverter<TSrc, TOut>()` before
    returning TOut. One switch at construction dispatches to a typed
    BuildCastingMoveNext<TSrc>() helper — the per-element hot path is
    then a `TSrc v = *(...)` read followed by a `conv(v)` delegate call,
    matching the legacy cast-iterator performance profile. For
    consistency with the legacy path, MoveNextReference throws when a
    cast is involved — you can't hand out a stable ref to a converted
    value.

AutoReset is implemented inline (`if (cursor >= size) cursor = 0` in the
contig path, ValueOffsetIncrementorAutoresetting in the strided path)
rather than via modulo-per-call so the steady-state cost is a single
predictable branch per MoveNext.

Memory: iteration now costs O(1) for contig, O(ndim) for the
incrementor's Index[] and internal state on strided. No full-array
allocation regardless of source size.

Test impact: 6,748 / 6,748 passing on net8.0 + net10.0 with the CI
filter (TestCategory!=OpenBugs&TestCategory!=HighMemory). Smoke test
covering contig / strided / transposed / cross-type / auto-reset / Reset /
foreach round-trip all match expected element sequences.
Replaces the lazy-but-standalone ValueOffsetIncrementor path with one
that constructs an NpyIter state and drives MoveNext / HasNext / Reset
directly off that state. NDIterator is now an honest thin wrapper
over NpyIter — the same traversal machinery used by all the Phase 2
production call sites — rather than reimplementing the coord-walk
logic with legacy incrementors.

How it works
------------
- ctor calls NpyIterRef.New(arr, NPY_CORDER) to build the state, then
  transfers ownership of the NpyIterState* pointer out of the ref
  struct (see NpyIterRef.ReleaseState / FreeState below). The class
  holds that pointer for its lifetime and frees it in Dispose (or in
  the finalizer as a safety net).
- MoveNext reads `*(TOut*)state->DataPtrs[0]` then calls
  `state->Advance()`. IterIndex tracks position, IterEnd bounds the
  non-AutoReset case, and `state->Reset()` restarts from IterStart on
  AutoReset wraparound and on explicit Reset.
- Cross-dtype wraps the same read with a Converts.FindConverter<TSrc, TOut>
  lookup — one switch at construction picks the typed helper, so the
  per-element hot path is still just one read + one converter delegate
  call. MoveNextReference throws when casting is in play, matching the
  legacy contract.
- NPY_CORDER is explicit so iterating a transposed view yields the
  logical row-major order the old NDIterator provided. Without it,
  KEEPORDER would give memory-efficient order (which e.g.
  `b.T.AsIterator<int>()` would surface as `0 1 2 ... 11` instead of
  the expected `0 4 8 1 5 9 2 6 10 3 7 11`).

NpyIter additions
-----------------
- NpyIterRef.ReleaseState(): hand the owned NpyIterState* to a caller
  who needs it across a non-ref-struct boundary (e.g. a class field).
  Marks the ref struct as non-owning so its Dispose is a no-op.
- NpyIterRef.FreeState(NpyIterState*): static tear-down mirror of
  Dispose's cleanup path — frees buffers (when BUFFER set), calls
  FreeDimArrays, and NativeMemory.Free's the state pointer. The
  long-lived owner calls this from its own Dispose/finalizer.

Bug fixes along the way
-----------------------
NpyIter initialization previously computed base pointers as
`(byte*)arr.Address + (shape.offset * arr.dtypesize)` in two places
(initial broadcast setup on line 340 and ResetBasePointers on line
1972). `arr.dtypesize` goes through `Marshal.SizeOf(bool) == 4` because
bool is marshaled to win32 BOOL, but the in-memory `bool[]` storage is
1 byte per element. For strided bool arrays this produced a base
pointer 4× too far into the buffer.

Switched both sites to `arr.GetTypeCode.SizeOf()` which returns the
actual in-memory size (1 for bool). Surfaced by `Boolean_Strided_Odd`
once NDIterator started routing through NpyIter — previously only
LATENT because the legacy NDIterator path computed offsets in
element units, not bytes, and sidestepped the NpyIter init.

Test impact: 6,748 / 6,748 passing on net8.0 and net10.0 (CI filter:
TestCategory!=OpenBugs&TestCategory!=HighMemory). Smoke test of
same-type contig / cross-type / strided / transposed / broadcast /
AutoReset / Reset / foreach all produce the expected element sequences.
`UnmanagedStorage.DTypeSize` (exposed via `NDArray.dtypesize`) was
delegating to `Marshal.SizeOf(_dtype)`. For every numeric dtype that
matches, but for bool, `Marshal.SizeOf(typeof(bool)) == 4` because bool
is marshaled to win32 BOOL (32-bit). The in-memory layout of `bool[]`
is 1 byte per element, so every caller computing a byte offset as
`ptr + index * arr.dtypesize` was reading/writing 4× too far into the
buffer for bool arrays.

Switches to `_typecode.SizeOf()` which correctly returns 1 for bool and
matches `Marshal.SizeOf` for every other type. 21 existing call sites
(matmul, binary/unary/comparison/reduction ops, nan reductions, std/var,
argmax, random shuffle, boolean mask gather, etc.) now get the right
value without any downstream change.

The bug had been latent until the Phase 2 iterator migration started
routing more code paths through NpyIter.Copy and the new NDIterator
wrapper; it surfaced most visibly as `sliced_bool[mask]` returning the
wrong elements when the source was non-contiguous. With the root fix:

    var full = np.array(new[] { T,F,T,F,T,F,T,F,T });
    var sliced = full["::2"];            // [T,T,T,T,T] non-contig
    var result = sliced[new_bool_mask];  // now correct per-element

np.save.cs already special-cases bool before falling through to
`Marshal.SizeOf`, so serialization was unaffected. Remaining
Marshal.SizeOf references in the codebase are either in comments that
explain this exact issue, or in the `InfoOf<T>.Size` fallback that
only runs for types outside the 12 supported dtypes (e.g. Complex).

Tests: 6,748 / 6,748 passing on net8.0 and net10.0 with the CI filter
(TestCategory!=OpenBugs&TestCategory!=HighMemory).
- Delete 4 NPYITER analysis docs (audit, buffered reduce, deep audit,
  numpy differences) — information consolidated into codebase
- Delete 3 NDIterator.Cast files (Complex, Half, SByte) — casting now
  handled by unified NDIterator<T> backed by NpyIter state
- Update NDIterator.cs: minor adjustments from NpyIter backing refactor
- Update ILKernelGenerator.Scan.cs: scan kernel changes
- Update Default.MatMul.Strided.cs: add INumber<T> constraint support
  for generic matmul dispatch preparation
- Update Default.ClipNDArray.cs: initial NpFunc dispatch refactoring
  replacing 6 switch blocks (~84 cases) with generic dispatch methods
- Update np.full_like.cs: minor fix
- Update RELEASE_0.51.0-prerelease.md release notes
…neric dispatch

NpFunc is a reflection-cached generic dispatch utility that bridges
runtime NPTypeCode values to compile-time generic type parameters.
Hot path (cache hit) runs at ~32ns via Delegate[] array indexed by
NPTypeCode ordinal. Cold path uses MakeGenericMethod + CreateDelegate,
cached after first call per (method, typeCode) pair.

Core NpFunc changes:
- Dynamic table sizing: Delegate[] sized from max NPTypeCode enum value
  (was hardcoded [32], broke for NPTypeCode.Complex=128)
- Overloads for 0-6 args × void/returning × 1-3 NPTypeCodes + 1-2 Types
- SmartMatchTypes for multi-type dispatch (1→broadcast, N=N→positional,
  M<N→type-identity matching)
- Per-arity ConcurrentDictionary caches for multi-type dispatch

Files refactored (12 files, ~400 cases eliminated):

Previous session (5 files, ~196 cases):
- Default.ClipNDArray.cs: 6 dispatch methods for contiguous/general clip
- Default.Clip.cs: 3 dispatch methods for scalar clip with ChangeType
- Default.NonZero.cs: 3 dispatch methods for nonzero/count operations
- Default.BooleanMask.cs: 1 dispatch method for masked copy
- Default.Shift.cs: 2 dispatch methods for array/scalar shift

This session (7 files, ~202 cases):
- NDIteratorExtensions.cs: 5 overloads → 5 dispatch methods creating
  NDIterator<T> from NDArray/UnmanagedStorage/IArraySlice
- Default.Reduction.CumAdd.cs: axis dispatch via CumSumAxisKernel<T>,
  elementwise via IAdditionOperators<T,T,T> with default(T) init
- Default.Reduction.CumMul.cs: axis dispatch via CumProdAxisKernel<T>,
  elementwise via IMultiplyOperators + T.MultiplicativeIdentity init
- np.where.cs: iterator fallback + IL kernel dispatch via pointer cast
- np.random.randint.cs: int/long fill via INumberBase<T>.CreateTruncating
- NDArray.NOT.cs: IEquatable<T>.Equals(default) unifies bool NOT and
  numeric ==0 comparison into single generic method
- Default.LogicalReduction.cs: direct dispatch to ExecuteLogicalAxis<T>

Net: -1243 lines removed across 12 files, replacing repetitive per-type
switch cases with single generic dispatch methods.
Complex does not implement IComparable<T>, so NpFunc.Invoke into
ClipArrayBoundsDispatch/ClipArrayMinDispatch/ClipArrayMaxDispatch
crashed with ArgumentException on MakeGenericMethod.

Fix: add NPTypeCode.Complex pre-checks in ClipNDArrayContiguous,
ClipNDArrayGeneral, and ClipCore that route to dedicated Complex
clip methods using lexicographic comparison (real first, then imag).
NaN handling preserves the NaN-containing element as-is (not
replaced with NaN+NaN*i), matching NumPy np.maximum/np.minimum
behavior where "NaN wins" but the original value is returned.

Half NaN propagation: ILKernelGenerator.ClipArrayBoundsScalar,
ClipArrayMinScalar, ClipArrayMaxScalar fell through to the generic
CompareTo path for Half, which treats NaN as less-than-all (IEEE
totalOrder) instead of propagating it. Added Half-specific scalar
methods that check Half.IsNaN explicitly before comparison.

Also fix NpFunc table sizing: Delegate[] was hardcoded to [32] but
NPTypeCode.Complex=128 caused IndexOutOfRangeException. Now computed
dynamically from max NPTypeCode enum value at static init.

Fixes 14 test failures (12 Complex clip/maximum/minimum constraint
violations, 2 Half NaN propagation in maximum).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant